docs/workers · dapo · ZhangXiaoyun / verl

[feat] Megatron checkpoint support for current Llama and Qwen models (#687) · 5d0a7eaf

# Intro

Support Megatron checkpoint for Model, Optimizer States and RNG states,
with a new layer of abstraction: `MegatronCheckpointManager` like FSDP.
Also add checkpoint tests.

# Involved Issues and PRs

This solved issue #682 #605 , including PR #510 #634 #368 #330 . Thanks
for the great efforts of @uygnef, @ShareLer and @caaatch22 in these
contributions.

# TODOs

- [ ] Support Megatron dist checkpointing mechanism, now use
torch.save/load to store/restore model weights.
- [x] Quick: Also store hf format model.

---------

Co-authored-by: caaatch22 <mr.liumingjie@gmail.com>
Co-authored-by: Yu Feng <admin@fengyu.org>
Co-authored-by: ShareLer <sharele@163.com>

committed Mar 23, 2025

5d0a7eaf

Name	Last commit	Last update
..
fsdp_workers.rst		Loading commit data...
megatron_workers.rst		Loading commit data...
ray_trainer.rst		Loading commit data...