Files · 9db52329f67afb0ab779f943c35543d0ca21df2d · ZhangXiaoyun / verl

[misc] feat: support offload parameter and optimizer during rollout (#284) · 9db52329

- Fixed FSDP1 model offload
- With `actor_rollout_ref.actor.fsdp_config.param_offload=True \` and
`actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ `. The GPU
memory utilization can increase to 0.9
- With actor, critic and reference offload all enabled, there will only
be one model copy at a time in the GPU memory. Therefore, we can further
increase the `micro_batch_size_per_gpu` or `max_token_per_gpu`

**Specifically:**
- During rollout, only rollout model and KVCache are in the GPU memory.
- During critic compute values, only the critic model will stay in the
GPU memory while its optimizer and other model states are in CPU main
memory
- During actor update, the actor model, optimizer are stored on GPU
while the reference model and critic model, critic optimizer are
offloaded to CPU.

committed Feb 17, 2025

9db52329

Name	Last commit	Last update
.github/workflows		Loading commit data...
docker		Loading commit data...
docs		Loading commit data...
examples		Loading commit data...
patches		Loading commit data...
scripts		Loading commit data...
tests		Loading commit data...
verl		Loading commit data...
.gitignore		Loading commit data...
.readthedocs.yaml		Loading commit data...
.style.yapf		Loading commit data...
LICENSE		Loading commit data...
Notice.txt		Loading commit data...
README.md		Loading commit data...
pyproject.toml		Loading commit data...
requirements.txt		Loading commit data...
setup.py		Loading commit data...

README.md