Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
C
codev_data_agent
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Ziyuan Nan
codev_data_agent
Commits
22f6d271
Commit
22f6d271
authored
Mar 24, 2026
by
nanziyuan
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update
parent
cd7f6e7f
Expand all
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
125 additions
and
0 deletions
+125
-0
agent_bench.py
+0
-0
mini_spec.yaml
+0
-0
readme.md
+125
-0
No files found.
agent_bench.py
0 → 100644
View file @
22f6d271
This diff is collapsed.
Click to expand it.
mini_spec.yaml
View file @
22f6d271
This diff is collapsed.
Click to expand it.
readme.md
View file @
22f6d271
...
@@ -54,3 +54,128 @@ mini -t "Generate specification for {{code_path}} and save it at {{spec_path}}."
...
@@ -54,3 +54,128 @@ mini -t "Generate specification for {{code_path}} and save it at {{spec_path}}."
```
```
agent轨迹和一些运行中的信息会被保存到run.traj.json。建议保存,可以用于后续训练。
agent轨迹和一些运行中的信息会被保存到run.traj.json。建议保存,可以用于后续训练。
## RealBench
先根据https://github.com/IPRC-DIP/RealBench的Readme,配置环境并解压缩。
具体使用方法在
`agent_bench.py`
开头的文档。
```
python
#!/usr/bin/env python3
"""
agent_bench.py - Benchmark management tool for RTL agent evaluation.
This script manages the full workflow for evaluating RTL generation agents:
1. Generate benchmark tasks (mk_bench)
2. Evaluate model directly (run) - Calls LLM API, extracts Verilog code from response
3. Evaluate agent workflow (agent_run) - Uses mini-swe-agent framework
4. Collect results (collect)
5. Verify solutions (evaluate.py)
6. Pretty print results (pretty_print)
Subcommands:
mk_bench Generate benchmark tasks from project directories.
Creates a repository structure with verification files for each
module found in the projects.
Normal mode: Copies documentation to doc/ subdirectory.
Pure code mode (--pure-code): Copies the module's Verilog file
instead of documentation (for code-to-code tasks).
Usage: python agent_bench.py mk_bench --target <dir> [--pure-code]
run Test base reasoning model using litellm.
Directly calls the LLM API with task prompts to evaluate the
base model's ability without agent workflow.
The model is instructed to wrap Verilog code in ```verilog blocks.
The script extracts the pure code for the "code" field and saves
the raw response in the "raw_response" field.
Usage: python agent_bench.py run <repo> [output] --model <model> [options]
Options:
--base_url <url> API base URL (default: http://localhost:8000/v1)
--api_key <key> API key for authentication
--workers <n> Number of parallel workers (default: 1)
--n_samples <n> Samples per module (default: 1)
--disable_thinking Disable thinking mode (enabled by default)
--timeout <secs> Request timeout in seconds (default: 1800)
Example:
python agent_bench.py run ./agent/repo ./samples/Qwen3.5
\\
--model "openai/Qwen3.5-35B-A3B"
\\
--base_url "http://localhost:30000/v1"
\\
--workers 2 --n_samples 8
agent_run Run agent benchmark using mini-swe-agent.
Uses the mini-swe-agent framework to solve tasks with tool use
and iterative refinement. Each sample runs in an isolated
working directory to prevent file conflicts.
For N samples, creates N isolated copies (work_dir/sample_i/)
of the repository. Each copy runs independently.
Trajectory files (run.traj.json) are saved in each module
directory for later analysis.
Config is copied to work_dir/config.yaml as a backup.
Usage: python agent_bench.py agent_run <repo> <work_dir> [options]
Options:
-c, --config <path> Path to mini_code.yaml config file
-j, --workers <n> Number of parallel workers (default: 4)
--n_samples <n> Number of isolated samples (default: 1)
--timeout <secs> Timeout per task in seconds (default: 1800)
--resume Skip samples with existing trajectory files
Example:
python agent_bench.py agent_run ./agent/repo ./my_experiment
\\
-c ./agent/mini_code.yaml -j 4 --n_samples 3 --timeout 1800
collect Aggregate Verilog results into grouped JSONL files.
Scans a source directory for module subdirectories and collects
their .v files into JSONL format, grouped by project prefix.
Automatically detects testing directory structure (sample_* subdirs)
and collects from all samples with correct codeid.
Usage: python agent_bench.py collect --source <dir> --target <dir>
pretty_print Display verification results in a readable table format.
Reads the JSON file generated by evaluate.py and prints
formatted statistics including Pass@1 and Pass@5 metrics.
Usage: python agent_bench.py pretty_print <results_json>
Quick Start:
# 1. Generate benchmark tasks (normal mode with docs)
python agent_bench.py mk_bench --target ./agent/repo
# 1b. Generate benchmark tasks (pure code mode)
python agent_bench.py mk_bench --target ./agent/repo --pure-code
# 2. Evaluate model directly (base reasoning)
python agent_bench.py run ./agent/repo ./samples/Qwen3.5
\
--model "openai/Qwen3.5-35B-A3B"
\
--base_url "http://localhost:30000/v1"
\
--workers 2 --n_samples 8
# 3. Evaluate agent workflow (uses mini-swe-agent)
python agent_bench.py agent_run ./agent/repo ./my_experiment
\
-c ./agent/mini_code.yaml -j 4 --n_samples 3
# 4. Collect results (auto-detects sample_* subdirs)
python agent_bench.py collect --source ./my_experiment --target ./samples/NAME
# 5. Verify solutions
python evaluate.py --solution_name NAME --task_level module --num_samples 1
# 6. View results
python agent_bench.py pretty_print results/NAME_module_results.json
Environment Variables:
OPENAI_API_KEY - API authentication key (if not using --api_key)
```
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment