run_deepseek7b_llm_seq_balance.sh
1.64 KB
-
[algo] feat: support GRPO algorithm (#124) · cd52d8b3
- Implement KL loss, GRPO outcome adv, and utilize bon rollouts - Provide scripts for deepseek and qwen on GSM8k. Can provide more for other datasets. - Support seq balance - Train using qwen2-7b, GSM8k score can reach 0.89
Guangming Sheng committed