tests/checkpoint · v0.2 · ZhangXiaoyun / verl

[ckpt] feat: integrate checkpoint resume in RL ray trainer (#222) · 5a400bf2

**Features:**
- Save actor and critic checkpoint:
  - Model
  - Optimizer
  - lr_scheduler
  - rng_state
  - dataloader
- A complete checkpoint represents that dataloader, actor and critic (if
any) state are properly saved
- By default, we will not save the dataset but only store the dataloader
(with sampler) state

**Usage:**
- Support resume mode: auto, disable and resume_from_path
- auto: veRL will automatically check the latest checkpoint from
`trainer.default_local_dir`
   - disable: veRL will always train from scratch
- resume_from_path: When setting `resume_from_path`=True, then user only
need to set the resume_mode to the checkpoint path that you want to
load.

**TODO:**
- Support SFT resume in the next PR
- Support uploader

**Relevant issue:**
- https://github.com/volcengine/verl/issues/76
- https://github.com/volcengine/verl/issues/143

committed Feb 08, 2025

5a400bf2

Name	Last commit	Last update
..
test_fsdp_ckpt.py		Loading commit data...