-
[ckpt] feat: integrate checkpoint resume in RL ray trainer (#222) · 5a400bf2
**Features:** - Save actor and critic checkpoint: - Model - Optimizer - lr_scheduler - rng_state - dataloader - A complete checkpoint represents that dataloader, actor and critic (if any) state are properly saved - By default, we will not save the dataset but only store the dataloader (with sampler) state **Usage:** - Support resume mode: auto, disable and resume_from_path - auto: veRL will automatically check the latest checkpoint from `trainer.default_local_dir` - disable: veRL will always train from scratch - resume_from_path: When setting `resume_from_path`=True, then user only need to set the resume_mode to the checkpoint path that you want to load. **TODO:** - Support SFT resume in the next PR - Support uploader **Relevant issue:** - https://github.com/volcengine/verl/issues/76 - https://github.com/volcengine/verl/issues/143
Guangming Sheng committed
Name |
Last commit
|
Last update |
---|---|---|
.. | ||
test_fsdp_ckpt.py | Loading commit data... |