Name |
Last commit
|
Last update |
---|---|---|
.github/workflows | ||
docker | ||
docs | ||
examples | ||
patches | ||
scripts | ||
tests | ||
verl | ||
.gitignore | ||
.readthedocs.yaml | ||
.style.yapf | ||
LICENSE | ||
Notice.txt | ||
README.md | ||
pyproject.toml | ||
requirements.txt | ||
setup.py |
We have implemented the REINFORCE++ algorithm. To use it, specify the parameter `algorithm.adv_estimator=reinforce_plus_plus`. Preliminary performance evaluations were conducted within the [Unakar/Logic-RL](https://github.com/Unakar/Logic-RL) project, a reproduction of DeepSeek R1 Zero on the 2K Tiny Logic Puzzle Dataset. Results indicate that our REINFORCE++ implementation exhibits performance and training stability comparable to, or potentially exceeding, that of PPO and GRPO. Related issue: #68
Name |
Last commit
|
Last update |
---|---|---|
.github/workflows | Loading commit data... | |
docker | Loading commit data... | |
docs | Loading commit data... | |
examples | Loading commit data... | |
patches | Loading commit data... | |
scripts | Loading commit data... | |
tests | Loading commit data... | |
verl | Loading commit data... | |
.gitignore | Loading commit data... | |
.readthedocs.yaml | Loading commit data... | |
.style.yapf | Loading commit data... | |
LICENSE | Loading commit data... | |
Notice.txt | Loading commit data... | |
README.md | Loading commit data... | |
pyproject.toml | Loading commit data... | |
requirements.txt | Loading commit data... | |
setup.py | Loading commit data... |