- 05 Mar, 2025 3 commits
-
-
Add support for downloading models from modelscope by setting `VERL_USE_MODELSCOPE=True` --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>
Hong Zhang committed -
HL committed
-
calculate mfu in update actor/critic when using megatron workers
Mingjie LIU committed
-
- 04 Mar, 2025 5 commits
-
-
hoshi-hiyouga committed
-
This PR is a continuing work of #448 , in order to support e2e CI for Ascend NPU.
Shuqiao Li committed -
add DeepRetrieval to README Awesome work
Patrick Jiang committed -
## What does this PR do? 1. Separate the prompt part and the response part in reward manager to avoid the reward leakage of format reward. 2. Update the reward score function for Geometry3k dataset. 3. Update the content in the readme file. ## Who can review? @vermouth1992 @PeterSH6
hoshi-hiyouga committed -
add ReSearch to README Awesome work
Mingyang Chen committed
-
- 03 Mar, 2025 5 commits
-
-
Shuqiao Li committed
-
## What does this PR do? This PR migrates the feature of RL on VLMs in our implementation in [EasyR1](https://github.com/hiyouga/EasyR1) fork back to veRL. We have validated this feature using Qwen2.5-VL 7B model on 8*H100 GPUs. The configuration and data processing script are provided along this PR for easy reproducing. ## How to reproduce? 1. Download and preprocess the dataset ```bash python3 examples/data_preprocess/geo3k.py --local_dir ~/data/geo3k ``` 2. Start GRPO training ```bash bash examples/grpo_trainer/run_qwen2_5_vl-7b.sh ``` ## Dependencies - vllm>=0.7.3 - transformers>=4.49.0 - [qwen-vl-utils](https://pypi.org/project/qwen-vl-utils/) - [mathruler](https://pypi.org/project/mathruler/) ## Major Changes ### New dataflow for multimodal RL In this PR, we introduce two new concepts in the dataflow, `multi_modal_data` and `multi_modal_inputs`. The former means the multi-modal features required by the **rollout** worker (such as vLLM), while the latter means the multi-modal features required by the **actor/critic** worker (such as an HF model). They are different because the rollout and actor workers have their own data format requirements. Taking Qwen2-VL + huggingface + vLLM as an example, the data structure should be: - **multi_modal_data**: {"image": [PIL.Image, PIL.Image, ...]} - **multi_modal_inputs**: {"pixel_values": torch.Tensor, "image_grid_thw": torch.Tensor} Both of them are converted to numpy objects and placed in the non-tensor batch in DataProto. This design can be extended to other modalities/VLMs easily due to the agnostic of models. ### Other changes - Data - Support pre-processing the [Geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) dataset. - Support `config.data.image_key`, which should be **a list of Pillow images**. - Actor/Ref/Critic - Support `multi_modal_inputs`. - Process position ids to adapt to the m-rope . - Rollout - Update dtensor weight loader to adapt to the Qwen2-VL architecture in vLLM 0.7+. - Support `multi_modal_data`. - Use `raw_prompt_ids` as the vLLM inputs to **avoid unpadding** the input ids. - Reward Manager - Add **mathruler** for more accurate math scores on the Geometry 3k dataset - Models - Support calculating the position ids for the m-rope in Qwen2-VL. - Support removing padding in flash attention2 for m-rope (transformers itself **does not support it**). - Sharding Manager - Support all-gathering the non-tensor batch. - FSDP Workers / Checkpoint Merger - Support `AutoModelForVision2Seq` at model initialization. Note: The Ulysses parallelism is not completed yet. We will support it in the next update. ## Performance We provide the estimated MFU of the language model part for H100 GPUs. These values are lower than the actual ones because **we did not compute the FLOPs of the vision tower part**. - `remove_padding=False`: MFU ~7% - `remove_padding=True`: MFU ~20% The training and test reward score curves are presented as follows.  ## Who can review? @vermouth1992 @PeterSH6
hoshi-hiyouga committed -
forget to update params in generation.yaml #259
BearBiscuit committed -
# Support Megatron mcore 0.11 ## Description This PR introduces official support for Megatron mcore 0.11 with the following updates: - Upgraded Megatron to version `core_r0.11.0` - Applied compatibility patch `patches/mcore_r0.11.patch` - Removed legacy version support for cleaner implementation Special thanks to @chendong-1998 for: - Original Megatron upgrade from 0.4 to 0.6 (#93f6a7e) ## Compatibility Notes Current implementation requires careful handling due to dependency conflicts: - `megatron-core==0.11.0` requires torch>=2.6 - `vllm==0.6.3` requires torch==2.4 Installation constraints: 1. Must use vllm's torch dependency (2.4) as baseline 2. Do NOT run `pip install -e .` in mcore directory (will upgrade torch to 2.6) 3. Apply compatibility patch manually after installation ## Testing ### test with `verl/examples/ppo_trainer/run_deepseek_megatron.sh`  --------- Signed-off-by: chendong-1998 <chendong136@huawei.com> Co-authored-by: chendong-1998 <chendong136@huawei.com> Co-authored-by: gaoziyuan <gaoziyuan.955@bytedance.com> Co-authored-by: Sion Gao <gaoziyuan19@mails.ucas.ac.cn>
Yan Bai committed -
HL committed
-
- 02 Mar, 2025 6 commits
-
-
Reverts volcengine/verl#314
Chi Zhang committed -
Weizhe Chen committed
-
ZSL98 committed
-
Specify the IP address when calling the bind method.
Willem Jiang committed -
Guangming Sheng committed
-
Now APIs can be displayed: 
HL committed
-
- 01 Mar, 2025 2 commits
-
-
Lumeng Wu committed
-
Because of the ongoing updates in vLLM, I noticed that veRL currently cannot integrate with the nightly build of vLLM directly. The new DP feature in the nightly version can no longer be bypassed by simply adjusting the `data_parallel_size` parameter, and resolving this requires further investigation. As a temporary workaround, I recommend a customized installation of vLLM if the V1 engine is required. I have updated the relevant documentation accordingly to reflect this guidance.
ZSL98 committed
-
- 28 Feb, 2025 3 commits
-
-
Validation should not have shuffling.
Shawn/Yuxuan Tong committed -
This is an enhancement for the single batch strategy for `val_dataloader`, making https://github.com/volcengine/verl/pull/353 more robust.
Shawn/Yuxuan Tong committed -
Willem Jiang committed
-
- 27 Feb, 2025 6 commits
-
-
Add tensorboard in Tracking backends. The user can set the environment variable TENSORBOARD_DIR to specify the TensorBoard log path.
Hongji Zhu committed -
Chi Zhang committed
-
The current training script utilizes the same file during training and evaluation. It is surmised that this may be incorrect.
yaguang committed -
[ckpt] replace DataLoader with StatefulDataLoader to support resume training for SequentialSampler (#389) Try to resolve this [issue](https://github.com/volcengine/verl/issues/356). As suggested by this issue discussion, I replace default DataLoader with StatefulDataloader, which provides state_dict and load_state_dict methods that may support resuming the iterator position of mid-epoch checkpointing.
alexchiu committed -
Thanks: @HillZhang1999 - Related issue: https://github.com/volcengine/verl/issues/189 `[36m(main_task pid=3523385)[0m ValueError: max_num_batched_tokens (8192) is smaller than max_model_len (9216). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len.` When enable_chunked_prefill is activated, the aforementioned issue will be concealed. Please increase `max_num_batched_tokens` or `decrease max_model_len`.
Guangming Sheng committed -
Chi Zhang committed
-
- 26 Feb, 2025 2 commits
-
-
apis: add data proto to documentation page. use copy_to_local instead of copy_local_path_from_hdfs (#358)
HL committed -
- As titled
Guangming Sheng committed
-
- 25 Feb, 2025 4 commits
-
-
See issue: https://github.com/volcengine/verl/issues/342
Mingjie Liu committed -
#369 --------- Co-authored-by: Thom <zhangyi@zhangyideMacBook-Pro.local>
_T_L_R_ committed -
kriswang committed
-
Chi Zhang committed
-
- 24 Feb, 2025 4 commits
-
-
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
湛露先生 committed -
BearBiscuit committed
-
close #312 Add support for ulysses sp for transformers >= 0.48 I've tested transformers 0.45.0, 0.46.0, 0.47.0, 0.48.0 and 0.49.0, using sp=2 with the following script in my local env ```bash #!/bin/bash set -ex VERSIONS=("4.45.0" "4.46.0" "4.47.0" "4.48.0" "4.49.0") for version in "${VERSIONS[@]}"; do echo "Testing with Transformers version ${version}" echo "----------------------------------------" pip install "transformers==${version}" PYTHONPATH=./ torchrun --nproc_per_node=2 tests/model/test_transformers_ulysses.py echo "----------------------------------------" echo "Completed testing for version ${version}" echo "" done ```
zhou fan committed -
fix the issue[#331](https://github.com/volcengine/verl/issues/331)
BearBiscuit committed
-