Commits · 872022d0603112c4b76eac2d4305c333220dd361 · ZhangXiaoyun / verl

10 Mar, 2025 1 commit

[perf] fix: correct meta weight init error to support hsdp (#508) · 872022d0

Current bugs when enable hsdp:
- **Incorrect Division in Batch Sizes**
- `ppo_micro_batch`, `ppo_minibatch`, etc... should be divided by
`self.device_mesh.size()` instead of `self.device_mesh.shape[0]`.
- **Improper Weight Initialization** in
`get_init_weight_context_manager`
- The `get_init_weight_context_manager` function must initialize empty
weights only on local_rank == 0 within every fsdp mesh.
- When `sync_module_states=True`, PyTorch's FSDP first broadcasts
parameters within the fsdp process group and then within the ddp process
group. If weights are not initialized correctly on `local_rank == 0` of
each fsdp mesh, the synchronization process may fail or produce
incorrect results.
https://github.com/pytorch/pytorch/blob/3f069e7679588d5ee4b1d5b2492ca0e20f9320b5/torch/distributed/fsdp/_init_utils.py#L614-L621
- Ensure initialization occurs only when
`self.device_mesh.get_coordinate()[-1] == 0`, which corresponds to
`local_rank == 0 `within each fsdp mesh.

committed Mar 10, 2025

872022d0 Browse Files

08 Mar, 2025 2 commits
- fix `_build_model_optimizer` when role is rollout, whose `optim_config` is None (#322) · f8acd901
  Haosheng Zou (邹昊晟) committed Mar 08, 2025
  
  f8acd901 Browse Files
- feat: support loading reward function from an external file (#452) · 13a87c76
  Lumeng Wu committed Mar 08, 2025
  
  13a87c76 Browse Files
07 Mar, 2025 8 commits

[CI] feat: auto cancel previous CI in the same PR (#499) · 90109ffd

- [x] Add concurrency to workflows to cancel previous workflows when new
commit is pushed to the same branch.
- [ ] Cancel all workflows/jobs from the same commit if any fails? (Not
sure whether we really need it)

Note: we leave out `secrets_scan.yml` and `scorecard.yml` to avoid any
possible leakage or security risk, which also cost little.

committed Mar 07, 2025

90109ffd Browse Files

Resolve the issue of PRIME getting stuck during math verification. (#469) · b7423038

Since searching for an appropriate `simplify` algorithm may cause
`sympy.simplify` to timeout, and `ProcessPool` may get stuck due to
excessive concurrency, the timeout mechanism in
`verl/verl/workers/reward_manager/prime.py` cannot capture the timeout.
To address this issue, a timeout detection mechanism is added to
`verl/verl/utils/reward_score/prime_math/__init__.py` for
`sympy.simplify` to solve it easily.

committed Mar 07, 2025

b7423038 Browse Files

[misc] feat: make filter long prompt an option (#506) · 386cfabe

# Background

In RLHFDataset, we filter out prompts that are too long. This requires
apply_chat_template to the whole dataset, which is not scalable when the
dataset is large.
https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L132

Instead of performing filtering online, we probably want to move this
process offline and add an assertion to avoid truncation or simply
perform truncation

Reference: #502 

# Key Changes

- Add an option `data.filter_overlong_prompts=True \` to enable the
above data filtering. The default value is set to False, but we enable
it for all the example scripts.
- Add an option `data.truncation` to truncate the input_ids or prompt
length if they
exceed max_prompt_length. The default is 'error', which does not allow
the
max_prompt_length to be exceeded. The users should increase the
max_prompt_length if
  throwing the error. You can also set `left` and `right`.

### Suggestion for large-scale dataset.
For large-scale datasets, filtering overlong prompts could be
time-consuming. You should set `data.filtering_overlong_prompts=False`
and set `truncation='left'`. Also, please note that you should increase
`data.max_prompt_length` to avoid over-truncation of the prompts.

committed Mar 07, 2025

386cfabe Browse Files

fix missing raise keyword in NotImplementedError for hdfs loading (#507) · fbad52e1
zhou fan committed Mar 07, 2025

fbad52e1 Browse Files
misc: precheck resource pool available to prevent pg hang (#505) · 0f0bc5a5
```
close #503
```
Joel committed Mar 07, 2025
0f0bc5a5 Browse Files

Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism… · 35555d8a

Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495)

This PR combines multiple modifications.

# QWen2.5 checkpoint saver bug fix

Thanks for the efforts @uygnef contributed to #368 , we use the new
saver for model loader and saver for 3D parallelism support.

# Megatron backend 3D-parallelism test benches

We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well
as the CI workflows, all tested.

# Bug Fix for 3D-parallelism

Including configuration bugs as well as the module packing.

Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the
implementation with `torch.bmm`.

# Fully migration to Megatron Core

Now we only use Megatron core in verl, fully get rid of calling other
components. If they are in need, please integrate them into
`utils/megatron`.

---------

Co-authored-by: uygnef <admin@fengyu.org>

committed Mar 07, 2025

35555d8a Browse Files

test: Added the permission setting on the workflow (#504) · cb97d077
Willem Jiang committed Mar 07, 2025

cb97d077 Browse Files
[ckpt] sort pgs by node ip to make RANK consistent across nodes (#500) · becf7cb1
Joel committed Mar 07, 2025

becf7cb1 Browse Files

06 Mar, 2025 6 commits

fix: (1) skipped last step (2) redundant validation and logging (#409) · 3165d988

This PR solves these 2 following problems.

1. Last step skipped

`self.global_steps += 1` before if `self.global_steps >=
self.total_training_steps` makes the last step skipped.

We start from step 1, and we expect `self.total_training_steps` in
total.


https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L999-L1001

   When `self.global_steps == self.total_training_steps-1`:

   * we have only executed `self.total_training_steps-1` steps

   * `self.global_steps` is updated to `self.total_training_steps`
* `self.global_steps >= self.total_training_steps` is satisfied, and the
training ends.

   Therefore, we should put `self.global_steps += 1` at last

2. redundant validation and logging

If `self.total_training_steps % self.config.trainer.test_freq == 0` :

   * `self._validate()` will be executed twice 

1.
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L984

2.
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L1005

   * logging will also be executed twice

1.
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L985
and
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L997
2.
https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L1007

committed Mar 06, 2025

3165d988 Browse Files

[misc] feat: add allgather method to dataproto (#497) · 0cc2bdad
```
- Add allgather method to dataproto
- Add tests
- Replace existing raw allgather with this function
```
Chi Zhang committed Mar 06, 2025
0cc2bdad Browse Files
[Hardware] Support AMD (Rocm kernel) (#360) · 4a291fa7
Yusheng (Ethan) Su committed Mar 06, 2025

4a291fa7 Browse Files

[feat] add val_generations_to_log_to_swanlab (#480) · 75dedb57

In this PR, a `val_generations_to_log_to_swanlab` parameter has been
added. When this parameter is set to 1, it supports logging the
generated text from eval in SwanLab.

@hiyouga 

---

This pull request introduces logging of validation generations to
Swanlab in addition to Wandb. The changes include updates to several
configuration files and the addition of a new logging method in the
`ray_trainer.py` file.

Key changes include:

### Configuration Updates:
* Added `val_generations_to_log_to_swanlab` parameter to the `trainer`
section in the following configuration files:
  * `examples/split_placement/config/ppo_trainer_split.yaml`
  * `verl/trainer/config/ppo_megatron_trainer.yaml`
  * `verl/trainer/config/ppo_trainer.yaml`

### Code Updates:
* Added a new method `_maybe_log_val_generations_to_swanlab` to log
validation samples to Swanlab in `verl/trainer/ppo/ray_trainer.py`
* Updated the `_validate` method to call the new Swanlab logging method
in `verl/trainer/ppo/ray_trainer.py`

---

committed Mar 06, 2025

75dedb57 Browse Files

[fix] support for extra_info in prime mode (#476) · 25e1a982

### What does this PR do?
In the `naive` mode, passing `extra_info` information for reward
function calculation is
supported(https://github.com/volcengine/verl/pull/266), but the support
for the `prime` mode is missing. This will cause the reward functions
that use `extra_info` to fail to produce correct results in the `prime`
mode. This commit fixes this issue.
### Who can review?
@PeterSH6 @vermouth1992 @hiyouga or other people who have the authority?

committed Mar 06, 2025

25e1a982 Browse Files

[ci] feat: add ci timeout (#487) · c15c6447
```
Set timeout in CI to avoid infinite hang.
close #468
```
Chi Zhang committed Mar 06, 2025
c15c6447 Browse Files

05 Mar, 2025 5 commits

Add cognitive behavior paper (#489) · d414c479
Chi Zhang committed Mar 05, 2025

d414c479 Browse Files

[docs] update logger documentation (#482) · 686438ca

This pull request includes updates to the `docs/examples/config.rst`
file to enhance the documentation for the `Trainer` configuration. The
most important changes involve expanding the support for various logging
platforms.

Documentation updates:

*
[`docs/examples/config.rst`](diffhunk://#diff-f051f6df5187cb4805be686b3d10c480877a01e9a35ed98cd63cf8da6af03772L352-R354):
Updated the descriptions for `trainer.project_name`,
`trainer.experiment_name`, and `trainer.logger` to include support for
additional logging platforms such as swanlab, mlflow, and tensorboard.

committed Mar 05, 2025

686438ca Browse Files

support speed up downloading model from modelscope (#463) · 7a5e9496

Add support for downloading models from modelscope by setting
`VERL_USE_MODELSCOPE=True`

---------

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

committed Mar 05, 2025

7a5e9496 Browse Files

docs: add meetup info, and skythought (#478) · 1ca4bfaf
HL committed Mar 05, 2025

1ca4bfaf Browse Files
[feat] support mfu calculation for megatron_workers (#475) · 6d7d3707
```
calculate mfu in update actor/critic when using megatron workers
```
Mingjie LIU committed Mar 05, 2025
6d7d3707 Browse Files

04 Mar, 2025 5 commits
- [fix] use bicubic resampler for resizing image (#474) · b0e7a942
  hoshi-hiyouga committed Mar 05, 2025
  
  b0e7a942 Browse Files
- [CI] Add e2e_ascend CI (#465) · d78186d4
```
This PR is a continuing work of #448 , in order to support e2e CI for
Ascend NPU.
```
  Shuqiao Li committed Mar 04, 2025
  d78186d4 Browse Files
- [doc] add DeepRetrieval to awesome work (#464) · 03e0efaa
```
add DeepRetrieval to README Awesome work
```
  Patrick Jiang committed Mar 04, 2025
  03e0efaa Browse Files
- [fix] separate prompt and response in reward manager (#459) · 27d72812
```
## What does this PR do?

1. Separate the prompt part and the response part in reward manager to
avoid the reward leakage of format reward.
2. Update the reward score function for Geometry3k dataset.
3. Update the content in the readme file.

## Who can review?

@vermouth1992 @PeterSH6
```
  hoshi-hiyouga committed Mar 04, 2025
  27d72812 Browse Files
- [doc] add ReSearch to awesome work (#461) · 296e4111
```
add ReSearch to README Awesome work
```
  Mingyang Chen committed Mar 04, 2025
  296e4111 Browse Files
03 Mar, 2025 5 commits

Update install.rst fix typo (#450) · 65cceb3c
Shuqiao Li committed Mar 03, 2025

65cceb3c Browse Files

[feat] Initial support for VLMs, add Qwen2.5VL GRPO example (#386) · b46f55ec

## What does this PR do?

This PR migrates the feature of RL on VLMs in our implementation in
[EasyR1](https://github.com/hiyouga/EasyR1) fork back to veRL. We have
validated this feature using Qwen2.5-VL 7B model on 8*H100 GPUs. The
configuration and data processing script are provided along this PR for
easy reproducing.

## How to reproduce?

1. Download and preprocess the dataset

```bash
python3 examples/data_preprocess/geo3k.py --local_dir ~/data/geo3k
```

2. Start GRPO training

```bash
bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh
```

## Dependencies

- vllm>=0.7.3
- transformers>=4.49.0
- [qwen-vl-utils](https://pypi.org/project/qwen-vl-utils/)
- [mathruler](https://pypi.org/project/mathruler/)

## Major Changes

### New dataflow for multimodal RL

In this PR, we introduce two new concepts in the dataflow,
`multi_modal_data` and `multi_modal_inputs`. The former means the
multi-modal features required by the **rollout** worker (such as vLLM),
while the latter means the multi-modal features required by the
**actor/critic** worker (such as an HF model). They are different
because the rollout and actor workers have their own data format
requirements.

Taking Qwen2-VL + huggingface + vLLM as an example, the data structure
should be:

- **multi_modal_data**: {"image": [PIL.Image, PIL.Image, ...]}
- **multi_modal_inputs**: {"pixel_values": torch.Tensor,
"image_grid_thw": torch.Tensor}

Both of them are converted to numpy objects and placed in the non-tensor
batch in DataProto.

This design can be extended to other modalities/VLMs easily due to the
agnostic of models.

### Other changes

- Data
- Support pre-processing the
[Geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k)
dataset.
- Support `config.data.image_key`, which should be **a list of Pillow
images**.

- Actor/Ref/Critic
  - Support `multi_modal_inputs`.
  - Process position ids to adapt to the m-rope .

- Rollout
- Update dtensor weight loader to adapt to the Qwen2-VL architecture in
vLLM 0.7+.
  - Support `multi_modal_data`.
- Use `raw_prompt_ids` as the vLLM inputs to **avoid unpadding** the
input ids.

- Reward Manager
- Add **mathruler** for more accurate math scores on the Geometry 3k
dataset

- Models
  - Support calculating the position ids for the m-rope in Qwen2-VL.
- Support removing padding in flash attention2 for m-rope (transformers
itself **does not support it**).

- Sharding Manager
  - Support all-gathering the non-tensor batch.

- FSDP Workers / Checkpoint Merger
  - Support `AutoModelForVision2Seq` at model initialization.

Note: The Ulysses parallelism is not completed yet. We will support it
in the next update.

## Performance

We provide the estimated MFU of the language model part for H100 GPUs.
These values are lower than the actual ones because **we did not compute
the FLOPs of the vision tower part**.

- `remove_padding=False`: MFU ~7%
- `remove_padding=True`: MFU ~20%

The training and test reward score curves are presented as follows.


![image](https://github.com/user-attachments/assets/ecb9fc27-8591-4c5b-ae4b-4ba77c6e30f9)

## Who can review?

@vermouth1992 @PeterSH6

committed Mar 03, 2025

b46f55ec Browse Files

[fix] update yaml file for generation (#445) · a0a4d5fa
```
forget to update params in generation.yaml #259
```
BearBiscuit committed Mar 03, 2025
a0a4d5fa Browse Files

megatron：Update megatron-lm to `core_r0.11.0` (#392) · 0cfd548c

# Support Megatron mcore 0.11

## Description
This PR introduces official support for Megatron mcore 0.11 with the
following updates:
- Upgraded Megatron to version `core_r0.11.0`
- Applied compatibility patch `patches/mcore_r0.11.patch`
- Removed legacy version support for cleaner implementation

Special thanks to @chendong-1998 for:
- Original Megatron upgrade from 0.4 to 0.6 (#93f6a7e)

## Compatibility Notes
Current implementation requires careful handling due to dependency
conflicts:
- `megatron-core==0.11.0` requires torch>=2.6
- `vllm==0.6.3` requires torch==2.4

Installation constraints:
1. Must use vllm's torch dependency (2.4) as baseline
2. Do NOT run `pip install -e .` in mcore directory (will upgrade torch
to 2.6)
3. Apply compatibility patch manually after installation

## Testing
### test with `verl/examples/ppo_trainer/run_deepseek_megatron.sh`

![image](https://github.com/user-attachments/assets/e053c9b8-fdd7-47fc-aaeb-42cf85070056)

---------

Signed-off-by: chendong-1998 <chendong136@huawei.com>
Co-authored-by: chendong-1998 <chendong136@huawei.com>
Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com>
Co-authored-by: Sion Gao <gaoziyuan19@mails.ucas.ac.cn>

committed Mar 03, 2025

0cfd548c Browse Files

fire rollout: fix main_generation config and failed tests (#443) · 85768a5c
HL committed Mar 03, 2025

85768a5c Browse Files

02 Mar, 2025 6 commits
- Revert "fix: bind the port with IP address" (#442) · 55a0ab99
```
Reverts volcengine/verl#314
```
  Chi Zhang committed Mar 02, 2025
  55a0ab99 Browse Files
- rollout: FIRE sampling added. (#58) · b677a61e
  Weizhe Chen committed Mar 02, 2025
  
  b677a61e Browse Files
- vllm: fix issue #438 (#440) · 128781cf
  ZSL98 committed Mar 02, 2025
  
  128781cf Browse Files
- fix: bind the port with IP address (#314) · fb13a07d
```
Specify the IP address when calling the bind method.
```
  Willem Jiang committed Mar 02, 2025
  fb13a07d Browse Files
- [doc] add Code-R1 to readme awesome work (#437) · 5273011d
  Guangming Sheng committed Mar 02, 2025
  
  5273011d Browse Files
- docs: add hf ckpt to faq, and include verl apis in the website (#427) · fe547a33
```
Now APIs can be displayed: 


![image](https://github.com/user-attachments/assets/6592ce68-7bf6-46cb-8dd3-a5fa6cd99f3e)
```
  HL committed Mar 02, 2025
  fe547a33 Browse Files
01 Mar, 2025 2 commits

fix: 2 typos (#435) · 99fb2dde
Lumeng Wu committed Mar 01, 2025

99fb2dde Browse Files

Update vLLM>=0.7 doc (#432) · cef4c2de

Because of the ongoing updates in vLLM, I noticed that veRL currently
cannot integrate with the nightly build of vLLM directly. The new DP
feature in the nightly version can no longer be bypassed by simply
adjusting the `data_parallel_size` parameter, and resolving this
requires further investigation.

As a temporary workaround, I recommend a customized installation of vLLM
if the V1 engine is required. I have updated the relevant documentation
accordingly to reflect this guidance.

committed Mar 01, 2025

cef4c2de Browse Files