Files · bdb50ac333fad7315eee2d009cc98013ce0c1e8a · ZhangXiaoyun / verl

implement REINFORCE++ algorithm (#228) · bdb50ac3

We have implemented the REINFORCE++ algorithm.

To use it, specify the parameter
`algorithm.adv_estimator=reinforce_plus_plus`.

Preliminary performance evaluations were conducted within the
[Unakar/Logic-RL](https://github.com/Unakar/Logic-RL) project, a
reproduction of DeepSeek R1 Zero on the 2K Tiny Logic Puzzle Dataset.
Results indicate that our REINFORCE++ implementation exhibits
performance and training stability comparable to, or potentially
exceeding, that of PPO and GRPO.

Related issue: #68

committed Feb 09, 2025

bdb50ac3

Name	Last commit	Last update
.github/workflows		Loading commit data...
docker		Loading commit data...
docs		Loading commit data...
examples		Loading commit data...
patches		Loading commit data...
scripts		Loading commit data...
tests		Loading commit data...
verl		Loading commit data...
.gitignore		Loading commit data...
.readthedocs.yaml		Loading commit data...
.style.yapf		Loading commit data...
LICENSE		Loading commit data...
Notice.txt		Loading commit data...
README.md		Loading commit data...
pyproject.toml		Loading commit data...
requirements.txt		Loading commit data...
setup.py		Loading commit data...

README.md