Files · 35555d8ae929516843a44454a80159c6aa6caa58 · ZhangXiaoyun / verl

Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism… · 35555d8a

Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495)

This PR combines multiple modifications.

# QWen2.5 checkpoint saver bug fix

Thanks for the efforts @uygnef contributed to #368 , we use the new
saver for model loader and saver for 3D parallelism support.

# Megatron backend 3D-parallelism test benches

We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well
as the CI workflows, all tested.

# Bug Fix for 3D-parallelism

Including configuration bugs as well as the module packing.

Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the
implementation with `torch.bmm`.

# Fully migration to Megatron Core

Now we only use Megatron core in verl, fully get rid of calling other
components. If they are in need, please integrate them into
`utils/megatron`.

---------

Co-authored-by: uygnef <admin@fengyu.org>

committed Mar 07, 2025

35555d8a

Name	Last commit	Last update
.github		Loading commit data...
docker		Loading commit data...
docs		Loading commit data...
examples		Loading commit data...
patches		Loading commit data...
scripts		Loading commit data...
tests		Loading commit data...
verl		Loading commit data...
.gitignore		Loading commit data...
.readthedocs.yaml		Loading commit data...
.style.yapf		Loading commit data...
LICENSE		Loading commit data...
Notice.txt		Loading commit data...
README.md		Loading commit data...
pyproject.toml		Loading commit data...
requirements.txt		Loading commit data...
setup.py		Loading commit data...

README.md