Commit 22f6d271 by nanziyuan

update

parent cd7f6e7f
This diff is collapsed. Click to expand it.
...@@ -54,3 +54,128 @@ mini -t "Generate specification for {{code_path}} and save it at {{spec_path}}." ...@@ -54,3 +54,128 @@ mini -t "Generate specification for {{code_path}} and save it at {{spec_path}}."
``` ```
agent轨迹和一些运行中的信息会被保存到run.traj.json。建议保存,可以用于后续训练。 agent轨迹和一些运行中的信息会被保存到run.traj.json。建议保存,可以用于后续训练。
## RealBench
先根据https://github.com/IPRC-DIP/RealBench的Readme,配置环境并解压缩。
具体使用方法在`agent_bench.py` 开头的文档。
```python
#!/usr/bin/env python3
"""
agent_bench.py - Benchmark management tool for RTL agent evaluation.
This script manages the full workflow for evaluating RTL generation agents:
1. Generate benchmark tasks (mk_bench)
2. Evaluate model directly (run) - Calls LLM API, extracts Verilog code from response
3. Evaluate agent workflow (agent_run) - Uses mini-swe-agent framework
4. Collect results (collect)
5. Verify solutions (evaluate.py)
6. Pretty print results (pretty_print)
Subcommands:
mk_bench Generate benchmark tasks from project directories.
Creates a repository structure with verification files for each
module found in the projects.
Normal mode: Copies documentation to doc/ subdirectory.
Pure code mode (--pure-code): Copies the module's Verilog file
instead of documentation (for code-to-code tasks).
Usage: python agent_bench.py mk_bench --target <dir> [--pure-code]
run Test base reasoning model using litellm.
Directly calls the LLM API with task prompts to evaluate the
base model's ability without agent workflow.
The model is instructed to wrap Verilog code in ```verilog blocks.
The script extracts the pure code for the "code" field and saves
the raw response in the "raw_response" field.
Usage: python agent_bench.py run <repo> [output] --model <model> [options]
Options:
--base_url <url> API base URL (default: http://localhost:8000/v1)
--api_key <key> API key for authentication
--workers <n> Number of parallel workers (default: 1)
--n_samples <n> Samples per module (default: 1)
--disable_thinking Disable thinking mode (enabled by default)
--timeout <secs> Request timeout in seconds (default: 1800)
Example:
python agent_bench.py run ./agent/repo ./samples/Qwen3.5 \\
--model "openai/Qwen3.5-35B-A3B" \\
--base_url "http://localhost:30000/v1" \\
--workers 2 --n_samples 8
agent_run Run agent benchmark using mini-swe-agent.
Uses the mini-swe-agent framework to solve tasks with tool use
and iterative refinement. Each sample runs in an isolated
working directory to prevent file conflicts.
For N samples, creates N isolated copies (work_dir/sample_i/)
of the repository. Each copy runs independently.
Trajectory files (run.traj.json) are saved in each module
directory for later analysis.
Config is copied to work_dir/config.yaml as a backup.
Usage: python agent_bench.py agent_run <repo> <work_dir> [options]
Options:
-c, --config <path> Path to mini_code.yaml config file
-j, --workers <n> Number of parallel workers (default: 4)
--n_samples <n> Number of isolated samples (default: 1)
--timeout <secs> Timeout per task in seconds (default: 1800)
--resume Skip samples with existing trajectory files
Example:
python agent_bench.py agent_run ./agent/repo ./my_experiment \\
-c ./agent/mini_code.yaml -j 4 --n_samples 3 --timeout 1800
collect Aggregate Verilog results into grouped JSONL files.
Scans a source directory for module subdirectories and collects
their .v files into JSONL format, grouped by project prefix.
Automatically detects testing directory structure (sample_* subdirs)
and collects from all samples with correct codeid.
Usage: python agent_bench.py collect --source <dir> --target <dir>
pretty_print Display verification results in a readable table format.
Reads the JSON file generated by evaluate.py and prints
formatted statistics including Pass@1 and Pass@5 metrics.
Usage: python agent_bench.py pretty_print <results_json>
Quick Start:
# 1. Generate benchmark tasks (normal mode with docs)
python agent_bench.py mk_bench --target ./agent/repo
# 1b. Generate benchmark tasks (pure code mode)
python agent_bench.py mk_bench --target ./agent/repo --pure-code
# 2. Evaluate model directly (base reasoning)
python agent_bench.py run ./agent/repo ./samples/Qwen3.5 \
--model "openai/Qwen3.5-35B-A3B" \
--base_url "http://localhost:30000/v1" \
--workers 2 --n_samples 8
# 3. Evaluate agent workflow (uses mini-swe-agent)
python agent_bench.py agent_run ./agent/repo ./my_experiment \
-c ./agent/mini_code.yaml -j 4 --n_samples 3
# 4. Collect results (auto-detects sample_* subdirs)
python agent_bench.py collect --source ./my_experiment --target ./samples/NAME
# 5. Verify solutions
python evaluate.py --solution_name NAME --task_level module --num_samples 1
# 6. View results
python agent_bench.py pretty_print results/NAME_module_results.json
Environment Variables:
OPENAI_API_KEY - API authentication key (if not using --api_key)
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment