[algo] feat: support GRPO algorithm (#124)
- Implement KL loss, GRPO outcome adv, and utilize bon rollouts - Provide scripts for deepseek and qwen on GSM8k. Can provide more for other datasets. - Support seq balance - Train using qwen2-7b, GSM8k score can reach 0.89
Showing
examples/grpo_trainer/run_deepseek7b_llm.sh
0 → 100644
examples/grpo_trainer/run_qwen2-7b.sh
0 → 100644
tests/__init__.py
0 → 100644
Please
register
or
sign in
to comment