update

22f6d271 · nanziyuan · cd7f6e7f · 22f6d271 · 22f6d271 · 22f6d271
Commit 22f6d271 authored Mar 24, 2026 by nanziyuan
Expand all Show whitespace changes
Inline Side-by-side

Showing with 125 additions and 0 deletions

agent_bench.py
+0 -0

mini_spec.yaml
+0 -0

readme.md
+125 -0

No files found.
--- a/agent_bench.py
+++ b/agent_bench.py
--- a/mini_spec.yaml
+++ b/mini_spec.yaml
--- a/readme.md
+++ b/readme.md
@@ -54,3 +54,128 @@ mini -t "Generate specification for {{code_path}} and save it at {{spec_path}}."
 ```
 agent轨迹和一些运行中的信息会被保存到run.traj.json。建议保存，可以用于后续训练。
+## RealBench
+先根据https://github.com/IPRC-DIP/RealBench的Readme，配置环境并解压缩。
+具体使用方法在`agent_bench.py` 开头的文档。
+```python
+#!/usr/bin/env python3
+"""
+agent_bench.py - Benchmark management tool for RTL agent evaluation.
+This script manages the full workflow for evaluating RTL generation agents:
+1. Generate benchmark tasks (mk_bench)
+2. Evaluate model directly (run) - Calls LLM API, extracts Verilog code from response
+3. Evaluate agent workflow (agent_run) - Uses mini-swe-agent framework
+4. Collect results (collect)
+5. Verify solutions (evaluate.py)
+6. Pretty print results (pretty_print)
+Subcommands:
+  mk_bench    Generate benchmark tasks from project directories.
+              Creates a repository structure with verification files for each
+              module found in the projects.
+              Normal mode: Copies documentation to doc/ subdirectory.
+              Pure code mode (--pure-code): Copies the module's Verilog file
+              instead of documentation (for code-to-code tasks).
+              Usage: python agent_bench.py mk_bench --target <dir> [--pure-code]
+  run         Test base reasoning model using litellm.
+              Directly calls the LLM API with task prompts to evaluate the
+              base model's ability without agent workflow.
+              The model is instructed to wrap Verilog code in ```verilog blocks.
+              The script extracts the pure code for the "code" field and saves
+              the raw response in the "raw_response" field.
+              Usage: python agent_bench.py run <repo> [output] --model <model> [options]
+              Options:
+                --base_url <url>       API base URL (default: http://localhost:8000/v1)
+                --api_key <key>        API key for authentication
+                --workers <n>          Number of parallel workers (default: 1)
+                --n_samples <n>        Samples per module (default: 1)
+                --disable_thinking     Disable thinking mode (enabled by default)
+                --timeout <secs>       Request timeout in seconds (default: 1800)
+              Example:
+                python agent_bench.py run ./agent/repo ./samples/Qwen3.5 \\
+                    --model "openai/Qwen3.5-35B-A3B" \\
+                    --base_url "http://localhost:30000/v1" \\
+                    --workers 2 --n_samples 8
+  agent_run   Run agent benchmark using mini-swe-agent.
+              Uses the mini-swe-agent framework to solve tasks with tool use
+              and iterative refinement. Each sample runs in an isolated
+              working directory to prevent file conflicts.
+              For N samples, creates N isolated copies (work_dir/sample_i/)
+              of the repository. Each copy runs independently.
+              Trajectory files (run.traj.json) are saved in each module
+              directory for later analysis.
+              Config is copied to work_dir/config.yaml as a backup.
+              Usage: python agent_bench.py agent_run <repo> <work_dir> [options]
+              Options:
+                -c, --config <path>    Path to mini_code.yaml config file
+                -j, --workers <n>      Number of parallel workers (default: 4)
+                --n_samples <n>        Number of isolated samples (default: 1)
+                --timeout <secs>       Timeout per task in seconds (default: 1800)
+                --resume               Skip samples with existing trajectory files
+              Example:
+                python agent_bench.py agent_run ./agent/repo ./my_experiment \\
+                    -c ./agent/mini_code.yaml -j 4 --n_samples 3 --timeout 1800
+  collect     Aggregate Verilog results into grouped JSONL files.
+              Scans a source directory for module subdirectories and collects
+              their .v files into JSONL format, grouped by project prefix.
+              Automatically detects testing directory structure (sample_* subdirs)
+              and collects from all samples with correct codeid.
+              Usage: python agent_bench.py collect --source <dir> --target <dir>
+  pretty_print  Display verification results in a readable table format.
+                Reads the JSON file generated by evaluate.py and prints
+                formatted statistics including Pass@1 and Pass@5 metrics.
+                Usage: python agent_bench.py pretty_print <results_json>
+Quick Start:
+    # 1. Generate benchmark tasks (normal mode with docs)
+    python agent_bench.py mk_bench --target ./agent/repo
+    # 1b. Generate benchmark tasks (pure code mode)
+    python agent_bench.py mk_bench --target ./agent/repo --pure-code
+    # 2. Evaluate model directly (base reasoning)
+    python agent_bench.py run ./agent/repo ./samples/Qwen3.5 \
+        --model "openai/Qwen3.5-35B-A3B" \
+        --base_url "http://localhost:30000/v1" \
+        --workers 2 --n_samples 8
+    # 3. Evaluate agent workflow (uses mini-swe-agent)
+    python agent_bench.py agent_run ./agent/repo ./my_experiment \
+        -c ./agent/mini_code.yaml -j 4 --n_samples 3
+    # 4. Collect results (auto-detects sample_* subdirs)
+    python agent_bench.py collect --source ./my_experiment --target ./samples/NAME
+    # 5. Verify solutions
+    python evaluate.py --solution_name NAME --task_level module --num_samples 1
+    # 6. View results
+    python agent_bench.py pretty_print results/NAME_module_results.json
+Environment Variables:
+    OPENAI_API_KEY         - API authentication key (if not using --api_key)
+```