docs · dapo · ZhangXiaoyun / verl

[feat] Megatron checkpoint support for current Llama and Qwen models (#687) · 5d0a7eaf

# Intro

Support Megatron checkpoint for Model, Optimizer States and RNG states,
with a new layer of abstraction: `MegatronCheckpointManager` like FSDP.
Also add checkpoint tests.

# Involved Issues and PRs

This solved issue #682 #605 , including PR #510 #634 #368 #330 . Thanks
for the great efforts of @uygnef, @ShareLer and @caaatch22 in these
contributions.

# TODOs

- [ ] Support Megatron dist checkpointing mechanism, now use
torch.save/load to store/restore model weights.
- [x] Quick: Also store hf format model.

---------

Co-authored-by: caaatch22 <mr.liumingjie@gmail.com>
Co-authored-by: Yu Feng <admin@fengyu.org>
Co-authored-by: ShareLer <sharele@163.com>

committed Mar 23, 2025

5d0a7eaf

Name	Last commit	Last update
..
_static		Loading commit data...
advance		Loading commit data...
amd_tutorial		Loading commit data...
examples		Loading commit data...
experiment		Loading commit data...
faq		Loading commit data...
perf		Loading commit data...
preparation		Loading commit data...
start		Loading commit data...
workers		Loading commit data...
Makefile		Loading commit data...
README.md		Loading commit data...
README_vllm0.7.md		Loading commit data...
README_vllm0.8.md		Loading commit data...
conf.py		Loading commit data...
data.rst		Loading commit data...
hybrid_flow.rst		Loading commit data...
index.rst		Loading commit data...
requirements-docs.txt		Loading commit data...

README.md