Commits · 35555d8ae929516843a44454a80159c6aa6caa58 · ZhangXiaoyun / verl

07 Mar, 2025 3 commits

Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism… · 35555d8a

Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495)

This PR combines multiple modifications.

# QWen2.5 checkpoint saver bug fix

Thanks for the efforts @uygnef contributed to #368 , we use the new
saver for model loader and saver for 3D parallelism support.

# Megatron backend 3D-parallelism test benches

We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well
as the CI workflows, all tested.

# Bug Fix for 3D-parallelism

Including configuration bugs as well as the module packing.

Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the
implementation with `torch.bmm`.

# Fully migration to Megatron Core

Now we only use Megatron core in verl, fully get rid of calling other
components. If they are in need, please integrate them into
`utils/megatron`.

---------

Co-authored-by: uygnef <admin@fengyu.org>

committed Mar 07, 2025

35555d8a Browse Files

test: Added the permission setting on the workflow (#504) · cb97d077
Willem Jiang committed Mar 07, 2025

cb97d077 Browse Files
[ckpt] sort pgs by node ip to make RANK consistent across nodes (#500) · becf7cb1
Joel committed Mar 07, 2025

becf7cb1 Browse Files

06 Mar, 2025 6 commits

fix: (1) skipped last step (2) redundant validation and logging (#409) · 3165d988

This PR solves these 2 following problems.

1. Last step skipped

`self.global_steps += 1` before if `self.global_steps >=
self.total_training_steps` makes the last step skipped.

We start from step 1, and we expect `self.total_training_steps` in
total.


https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L999-L1001

   When `self.global_steps == self.total_training_steps-1`:

   * we have only executed `self.total_training_steps-1` steps

   * `self.global_steps` is updated to `self.total_training_steps`
* `self.global_steps >= self.total_training_steps` is satisfied, and the
training ends.

   Therefore, we should put `self.global_steps += 1` at last

2. redundant validation and logging

If `self.total_training_steps % self.config.trainer.test_freq == 0` :

   * `self._validate()` will be executed twice 

1.
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L984

2.
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L1005

   * logging will also be executed twice

1.
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L985
and
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L997
2.
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L1007

committed Mar 06, 2025

3165d988 Browse Files

[misc] feat: add allgather method to dataproto (#497) · 0cc2bdad
```
- Add allgather method to dataproto
- Add tests
- Replace existing raw allgather with this function
```
Chi Zhang committed Mar 06, 2025
0cc2bdad Browse Files
[Hardware] Support AMD (Rocm kernel) (#360) · 4a291fa7
Yusheng (Ethan) Su committed Mar 06, 2025

4a291fa7 Browse Files

[feat] add val_generations_to_log_to_swanlab (#480) · 75dedb57

In this PR, a `val_generations_to_log_to_swanlab` parameter has been
added. When this parameter is set to 1, it supports logging the
generated text from eval in SwanLab.

@hiyouga 

---

This pull request introduces logging of validation generations to
Swanlab in addition to Wandb. The changes include updates to several
configuration files and the addition of a new logging method in the
`ray_trainer.py` file.

Key changes include:

### Configuration Updates:
* Added `val_generations_to_log_to_swanlab` parameter to the `trainer`
section in the following configuration files:
  * `examples/split_placement/config/ppo_trainer_split.yaml`
  * `verl/trainer/config/ppo_megatron_trainer.yaml`
  * `verl/trainer/config/ppo_trainer.yaml`

### Code Updates:
* Added a new method `_maybe_log_val_generations_to_swanlab` to log
validation samples to Swanlab in `verl/trainer/ppo/ray_trainer.py`
* Updated the `_validate` method to call the new Swanlab logging method
in `verl/trainer/ppo/ray_trainer.py`

---

committed Mar 06, 2025

75dedb57 Browse Files

[fix] support for extra_info in prime mode (#476) · 25e1a982

### What does this PR do?
In the `naive` mode, passing `extra_info` information for reward
function calculation is
supported(https://github.com/volcengine/verl/pull/266), but the support
for the `prime` mode is missing. This will cause the reward functions
that use `extra_info` to fail to produce correct results in the `prime`
mode. This commit fixes this issue.
### Who can review?
@PeterSH6 @vermouth1992 @hiyouga or other people who have the authority?

committed Mar 06, 2025

25e1a982 Browse Files

[ci] feat: add ci timeout (#487) · c15c6447
```
Set timeout in CI to avoid infinite hang.
close #468
```
Chi Zhang committed Mar 06, 2025
c15c6447 Browse Files

05 Mar, 2025 5 commits

Add cognitive behavior paper (#489) · d414c479
Chi Zhang committed Mar 05, 2025

d414c479 Browse Files

[docs] update logger documentation (#482) · 686438ca

This pull request includes updates to the `docs/examples/config.rst`
file to enhance the documentation for the `Trainer` configuration. The
most important changes involve expanding the support for various logging
platforms.

Documentation updates:

*
[`docs/examples/config.rst`](diffhunk://#diff-f051f6df5187cb4805be686b3d10c480877a01e9a35ed98cd63cf8da6af03772L352-R354):
Updated the descriptions for `trainer.project_name`,
`trainer.experiment_name`, and `trainer.logger` to include support for
additional logging platforms such as swanlab, mlflow, and tensorboard.

committed Mar 05, 2025

686438ca Browse Files

support speed up downloading model from modelscope (#463) · 7a5e9496

Add support for downloading models from modelscope by setting
`VERL_USE_MODELSCOPE=True`

---------

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

committed Mar 05, 2025

7a5e9496 Browse Files

docs: add meetup info, and skythought (#478) · 1ca4bfaf
HL committed Mar 05, 2025

1ca4bfaf Browse Files
[feat] support mfu calculation for megatron_workers (#475) · 6d7d3707
```
calculate mfu in update actor/critic when using megatron workers
```
Mingjie LIU committed Mar 05, 2025
6d7d3707 Browse Files

04 Mar, 2025 5 commits
- [fix] use bicubic resampler for resizing image (#474) · b0e7a942
  hoshi-hiyouga committed Mar 05, 2025
  
  b0e7a942 Browse Files
- [CI] Add e2e_ascend CI (#465) · d78186d4
```
This PR is a continuing work of #448 , in order to support e2e CI for
Ascend NPU.
```
  Shuqiao Li committed Mar 04, 2025
  d78186d4 Browse Files
- [doc] add DeepRetrieval to awesome work (#464) · 03e0efaa
```
add DeepRetrieval to README Awesome work
```
  Patrick Jiang committed Mar 04, 2025
  03e0efaa Browse Files
- [fix] separate prompt and response in reward manager (#459) · 27d72812
```
## What does this PR do?

1. Separate the prompt part and the response part in reward manager to
avoid the reward leakage of format reward.
2. Update the reward score function for Geometry3k dataset.
3. Update the content in the readme file.

## Who can review?

@vermouth1992 @PeterSH6
```
  hoshi-hiyouga committed Mar 04, 2025
  27d72812 Browse Files
- [doc] add ReSearch to awesome work (#461) · 296e4111
```
add ReSearch to README Awesome work
```
  Mingyang Chen committed Mar 04, 2025
  296e4111 Browse Files
03 Mar, 2025 5 commits

Update install.rst fix typo (#450) · 65cceb3c
Shuqiao Li committed Mar 03, 2025

65cceb3c Browse Files

[feat] Initial support for VLMs, add Qwen2.5VL GRPO example (#386) · b46f55ec

## What does this PR do?

This PR migrates the feature of RL on VLMs in our implementation in
[EasyR1](https://github.com/hiyouga/EasyR1) fork back to veRL. We have
validated this feature using Qwen2.5-VL 7B model on 8*H100 GPUs. The
configuration and data processing script are provided along this PR for
easy reproducing.

## How to reproduce?

1. Download and preprocess the dataset

```bash
python3 examples/data_preprocess/geo3k.py --local_dir ~/data/geo3k
```

2. Start GRPO training

```bash
bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh
```

## Dependencies

- vllm>=0.7.3
- transformers>=4.49.0
- [qwen-vl-utils](https://pypi.org/project/qwen-vl-utils/)
- [mathruler](https://pypi.org/project/mathruler/)

## Major Changes

### New dataflow for multimodal RL

In this PR, we introduce two new concepts in the dataflow,
`multi_modal_data` and `multi_modal_inputs`. The former means the
multi-modal features required by the **rollout** worker (such as vLLM),
while the latter means the multi-modal features required by the
**actor/critic** worker (such as an HF model). They are different
because the rollout and actor workers have their own data format
requirements.

Taking Qwen2-VL + huggingface + vLLM as an example, the data structure
should be:

- **multi_modal_data**: {"image": [PIL.Image, PIL.Image, ...]}
- **multi_modal_inputs**: {"pixel_values": torch.Tensor,
"image_grid_thw": torch.Tensor}

Both of them are converted to numpy objects and placed in the non-tensor
batch in DataProto.

This design can be extended to other modalities/VLMs easily due to the
agnostic of models.

### Other changes

- Data
- Support pre-processing the
[Geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k)
dataset.
- Support `config.data.image_key`, which should be **a list of Pillow
images**.

- Actor/Ref/Critic
  - Support `multi_modal_inputs`.
  - Process position ids to adapt to the m-rope .

- Rollout
- Update dtensor weight loader to adapt to the Qwen2-VL architecture in
vLLM 0.7+.
  - Support `multi_modal_data`.
- Use `raw_prompt_ids` as the vLLM inputs to **avoid unpadding** the
input ids.

- Reward Manager
- Add **mathruler** for more accurate math scores on the Geometry 3k
dataset

- Models
  - Support calculating the position ids for the m-rope in Qwen2-VL.
- Support removing padding in flash attention2 for m-rope (transformers
itself **does not support it**).

- Sharding Manager
  - Support all-gathering the non-tensor batch.

- FSDP Workers / Checkpoint Merger
  - Support `AutoModelForVision2Seq` at model initialization.

Note: The Ulysses parallelism is not completed yet. We will support it
in the next update.

## Performance

We provide the estimated MFU of the language model part for H100 GPUs.
These values are lower than the actual ones because **we did not compute
the FLOPs of the vision tower part**.

- `remove_padding=False`: MFU ~7%
- `remove_padding=True`: MFU ~20%

The training and test reward score curves are presented as follows.


![image](https://github.com/user-attachments/assets/ecb9fc27-8591-4c5b-ae4b-4ba77c6e30f9)

## Who can review?

@vermouth1992 @PeterSH6

committed Mar 03, 2025

b46f55ec Browse Files

[fix] update yaml file for generation (#445) · a0a4d5fa
```
forget to update params in generation.yaml #259
```
BearBiscuit committed Mar 03, 2025
a0a4d5fa Browse Files

megatron：Update megatron-lm to `core_r0.11.0` (#392) · 0cfd548c

# Support Megatron mcore 0.11

## Description
This PR introduces official support for Megatron mcore 0.11 with the
following updates:
- Upgraded Megatron to version `core_r0.11.0`
- Applied compatibility patch `patches/mcore_r0.11.patch`
- Removed legacy version support for cleaner implementation

Special thanks to @chendong-1998 for:
- Original Megatron upgrade from 0.4 to 0.6 (#93f6a7e)

## Compatibility Notes
Current implementation requires careful handling due to dependency
conflicts:
- `megatron-core==0.11.0` requires torch>=2.6
- `vllm==0.6.3` requires torch==2.4

Installation constraints:
1. Must use vllm's torch dependency (2.4) as baseline
2. Do NOT run `pip install -e .` in mcore directory (will upgrade torch
to 2.6)
3. Apply compatibility patch manually after installation

## Testing
### test with `verl/examples/ppo_trainer/run_deepseek_megatron.sh`

![image](https://github.com/user-attachments/assets/e053c9b8-fdd7-47fc-aaeb-42cf85070056)

---------

Signed-off-by: chendong-1998 <chendong136@huawei.com>
Co-authored-by: chendong-1998 <chendong136@huawei.com>
Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
Co-authored-by: Sion Gao <gaoziyuan19@mails.ucas.ac.cn>

committed Mar 03, 2025

0cfd548c Browse Files

fire rollout: fix main_generation config and failed tests (#443) · 85768a5c
HL committed Mar 03, 2025

85768a5c Browse Files

02 Mar, 2025 6 commits
- Revert "fix: bind the port with IP address" (#442) · 55a0ab99
```
Reverts volcengine/verl#314
```
  Chi Zhang committed Mar 02, 2025
  55a0ab99 Browse Files
- rollout: FIRE sampling added. (#58) · b677a61e
  Weizhe Chen committed Mar 02, 2025
  
  b677a61e Browse Files
- vllm: fix issue #438 (#440) · 128781cf
  ZSL98 committed Mar 02, 2025
  
  128781cf Browse Files
- fix: bind the port with IP address (#314) · fb13a07d
```
Specify the IP address when calling the bind method.
```
  Willem Jiang committed Mar 02, 2025
  fb13a07d Browse Files
- [doc] add Code-R1 to readme awesome work (#437) · 5273011d
  Guangming Sheng committed Mar 02, 2025
  
  5273011d Browse Files
- docs: add hf ckpt to faq, and include verl apis in the website (#427) · fe547a33
```
Now APIs can be displayed: 


![image](https://github.com/user-attachments/assets/6592ce68-7bf6-46cb-8dd3-a5fa6cd99f3e)
```
  HL committed Mar 02, 2025
  fe547a33 Browse Files
01 Mar, 2025 2 commits

fix: 2 typos (#435) · 99fb2dde
Lumeng Wu committed Mar 01, 2025

99fb2dde Browse Files

Update vLLM>=0.7 doc (#432) · cef4c2de

Because of the ongoing updates in vLLM, I noticed that veRL currently
cannot integrate with the nightly build of vLLM directly. The new DP
feature in the nightly version can no longer be bypassed by simply
adjusting the `data_parallel_size` parameter, and resolving this
requires further investigation.

As a temporary workaround, I recommend a customized installation of vLLM
if the V1 engine is required. I have updated the relevant documentation
accordingly to reflect this guidance.

committed Mar 01, 2025

cef4c2de Browse Files

28 Feb, 2025 3 commits
- [Fix] No Shuffling for `val_dataloader` (#423) · 021db112
```
Validation should not have shuffling.
```
  Shawn/Yuxuan Tong committed Feb 28, 2025
  021db112 Browse Files
- [Feature] Assert Single Batch for `val_dataloader` (#424) · 6e4a445f
```
This is an enhancement for the single batch strategy for
`val_dataloader`, making https://github.com/volcengine/verl/pull/353
more robust.
```
  Shawn/Yuxuan Tong committed Feb 28, 2025
  6e4a445f Browse Files
- ci: Added the secrets scan action (#417) · 60c92147
  Willem Jiang committed Feb 27, 2025
  
  60c92147 Browse Files
27 Feb, 2025 5 commits

[feat] tracking support tensorboard (#408) · 82b38e25

Add tensorboard in Tracking backends.

The user can set the environment variable TENSORBOARD_DIR to specify the
TensorBoard log path.

committed Feb 27, 2025

82b38e25 Browse Files

[ckpt] fix: fix oom when resume from ckpt (#402) · a0f05da8
Chi Zhang committed Feb 27, 2025

a0f05da8 Browse Files
[fix] Fix evaluation file path in remax training scripts. (#404) · 052b0a39
```
The current training script utilizes the same file during training and
evaluation. It is surmised that this may be incorrect.
```
yaguang committed Feb 27, 2025
052b0a39 Browse Files

[ckpt] replace DataLoader with StatefulDataLoader to support resume training for… · 96d98ccb

[ckpt] replace DataLoader with StatefulDataLoader to support resume training for SequentialSampler  (#389)

Try to resolve this
[issue](https://github.com/volcengine/verl/issues/356).

As suggested by this issue discussion, I replace default DataLoader with
StatefulDataloader, which provides state_dict and load_state_dict
methods that may support resuming the iterator position of mid-epoch
checkpointing.

committed Feb 27, 2025

96d98ccb Browse Files

[misc] fix: disable chunked-prefill by default (#259) · 558fae54

Thanks: @HillZhang1999

- Related issue: https://github.com/volcengine/verl/issues/189

`[36m(main_task pid=3523385)[0m ValueError: max_num_batched_tokens
(8192) is smaller than max_model_len (9216). This effectively limits the
maximum sequence length to max_num_batched_tokens and makes vLLM reject
longer sequences. Please increase max_num_batched_tokens or decrease
max_model_len.`

When enable_chunked_prefill is activated, the aforementioned issue will
be concealed. Please increase `max_num_batched_tokens` or `decrease
max_model_len`.

committed Feb 27, 2025

558fae54 Browse Files