MCPcopy Index your code
hub / github.com/SkyworkAI/DeepResearchAgent / test_leetcode_benchmark

Function test_leetcode_benchmark

tests/run_leetcode_agent.py:248–397  ·  view source on GitHub ↗

Test the benchmark manager specifically for LeetCode using a REAL model. Uses PIPELINE mode: inference and evaluation run in parallel. 流水线模式说明: - 推理完成 batch_size 个任务后,立即 push + 评测 - 同时继续推理下一批任务 - 推理失败的任务直接保存结果,不入队列 - 共享一个浏览器实例

(benchmark_name: str = "leetcode")

Source from the content-addressed store, hash-verified

246
247
248async def test_leetcode_benchmark(benchmark_name: str = "leetcode"):
249 """
250 Test the benchmark manager specifically for LeetCode using a REAL model.
251 Uses PIPELINE mode: inference and evaluation run in parallel.
252
253 流水线模式说明:
254 - 推理完成 batch_size 个任务后,立即 push + 评测
255 - 同时继续推理下一批任务
256 - 推理失败的任务直接保存结果,不入队列
257 - 共享一个浏览器实例
258 """
259 print(f"\n{'='*60}")
260 print(f"🧪 LeetCode Benchmark Test (Pipeline Mode)")
261 print(f"{'='*60}")
262 print(f"🤖 Model: {TARGET_MODEL}")
263 print(f"💻 Language: {TARGET_LANGUAGE}")
264 print(f"⚡ Max concurrent inference: {MAX_CONCURRENT_INFERENCE}")
265 print(f"📦 Batch size for evaluation: {BATCH_SIZE}")
266 print(f"{'='*60}\n")
267
268 # Define save directory
269 save_dir = os.path.join(config.workdir, "benchmark", benchmark_name)
270 if not os.path.exists(save_dir):
271 os.makedirs(save_dir, exist_ok=True)
272 print(f"📁 Created output directory: {save_dir}")
273
274 # 1. Reset and collect all tasks
275 print(f"🔄 Resetting progress for LeetCode...")
276 task = await benchmark_manager.reset(benchmark_name)
277
278 if not task:
279 logger.warning("⚠️ No tasks available to run (Dataset empty or all finished).")
280 summarize_benchmark_results(benchmark_name)
281 return
282
283 # ==========================================
284 # 收集所有待处理任务
285 # ==========================================
286 all_tasks: List[Task] = [task]
287 while True:
288 next_task = await benchmark_manager.step(benchmark_name)
289 if next_task is None:
290 break
291 all_tasks.append(next_task)
292
293 total_tasks = len(all_tasks)
294 print(f"📋 Collected {total_tasks} tasks for processing\n")
295
296 # ==========================================
297 # 获取 benchmark 实例
298 # ==========================================
299 benchmark = await benchmark_manager.get(benchmark_name)
300
301 # ==========================================
302 # 流水线模式:推理完成直接入队,满5个自动评测
303 # ==========================================
304 inference_semaphore = asyncio.Semaphore(MAX_CONCURRENT_INFERENCE)
305

Callers 1

mainFunction · 0.70

Calls 12

printFunction · 0.85
joinMethod · 0.80
warningMethod · 0.80
flush_eval_queueMethod · 0.80
existsMethod · 0.45
resetMethod · 0.45
stepMethod · 0.45
appendMethod · 0.45
getMethod · 0.45
get_queue_sizeMethod · 0.45

Tested by

no test coverage detected