examples/grpo_trainer/run_deepseek7b_llm_seq_balance.sh · v0.2.0.post1 · ZhangXiaoyun / verl

[algo] feat: support GRPO algorithm (#124) · cd52d8b3

- Implement KL loss, GRPO outcome adv, and utilize bon rollouts
- Provide scripts for deepseek and qwen on GSM8k. Can provide more for
other datasets.
- Support seq balance
- Train using qwen2-7b, GSM8k score can reach 0.89

committed Jan 23, 2025

cd52d8b3

run_deepseek7b_llm_seq_balance.sh 1.64 KB