[feat] Initial support for VLMs, add Qwen2.5VL GRPO example (#386)
## What does this PR do? This PR migrates the feature of RL on VLMs in our implementation in [EasyR1](https://github.com/hiyouga/EasyR1) fork back to veRL. We have validated this feature using Qwen2.5-VL 7B model on 8*H100 GPUs. The configuration and data processing script are provided along this PR for easy reproducing. ## How to reproduce? 1. Download and preprocess the dataset ```bash python3 examples/data_preprocess/geo3k.py --local_dir ~/data/geo3k ``` 2. Start GRPO training ```bash bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh ``` ## Dependencies - vllm>=0.7.3 - transformers>=4.49.0 - [qwen-vl-utils](https://pypi.org/project/qwen-vl-utils/) - [mathruler](https://pypi.org/project/mathruler/) ## Major Changes ### New dataflow for multimodal RL In this PR, we introduce two new concepts in the dataflow, `multi_modal_data` and `multi_modal_inputs`. The former means the multi-modal features required by the **rollout** worker (such as vLLM), while the latter means the multi-modal features required by the **actor/critic** worker (such as an HF model). They are different because the rollout and actor workers have their own data format requirements. Taking Qwen2-VL + huggingface + vLLM as an example, the data structure should be: - **multi_modal_data**: {"image": [PIL.Image, PIL.Image, ...]} - **multi_modal_inputs**: {"pixel_values": torch.Tensor, "image_grid_thw": torch.Tensor} Both of them are converted to numpy objects and placed in the non-tensor batch in DataProto. This design can be extended to other modalities/VLMs easily due to the agnostic of models. ### Other changes - Data - Support pre-processing the [Geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) dataset. - Support `config.data.image_key`, which should be **a list of Pillow images**. - Actor/Ref/Critic - Support `multi_modal_inputs`. - Process position ids to adapt to the m-rope . - Rollout - Update dtensor weight loader to adapt to the Qwen2-VL architecture in vLLM 0.7+. - Support `multi_modal_data`. - Use `raw_prompt_ids` as the vLLM inputs to **avoid unpadding** the input ids. - Reward Manager - Add **mathruler** for more accurate math scores on the Geometry 3k dataset - Models - Support calculating the position ids for the m-rope in Qwen2-VL. - Support removing padding in flash attention2 for m-rope (transformers itself **does not support it**). - Sharding Manager - Support all-gathering the non-tensor batch. - FSDP Workers / Checkpoint Merger - Support `AutoModelForVision2Seq` at model initialization. Note: The Ulysses parallelism is not completed yet. We will support it in the next update. ## Performance We provide the estimated MFU of the language model part for H100 GPUs. These values are lower than the actual ones because **we did not compute the FLOPs of the vision tower part**. - `remove_padding=False`: MFU ~7% - `remove_padding=True`: MFU ~20% The training and test reward score curves are presented as follows.  ## Who can review? @vermouth1992 @PeterSH6
Showing
.github/workflows/e2e_vlm_geo3k.yml
0 → 100644
examples/data_preprocess/geo3k.py
0 → 100644
examples/grpo_trainer/run_qwen2_5_vl-7b.sh
0 → 100644
tests/e2e/run_qwen2vl_geo3k_function_rm.sh
0 → 100644
verl/models/transformers/qwen2_vl.py
0 → 100644
This diff is collapsed.
Click to expand it.
verl/utils/reward_score/geo3k.py
0 → 100644
Please
register
or
sign in
to comment