- 14 Mar, 2025 4 commits
-
-
## Summary Providing an option in the config to turn off the `torch.compile` used in `dp_actor.py` ## Usage Adding the following line to the driver or cli scripts to turn off `torch.compile`. ```python +actor_rollout_ref.actor.use_torch_compile=False ``` Otherwise, `torch.compile` will be used by default ## Related Issue #354 #245 --------- Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Hongpeng Guo committed -
As a `DataProto` instance, calling `to(device)` already moves data.batch to the specified device. https://github.com/volcengine/verl/blob/329dcfe1dd60f2d736ee55914e2a49e1887718eb/verl/protocol.py#L324-L336
Lumeng Wu committed -
#354
Joel committed -
Follow-up to https://github.com/volcengine/verl/pull/309
Chenhui Zhang committed
-
- 13 Mar, 2025 6 commits
-
-
none0663 committed
-
Remove redundant broadcast in fsdp vllm postprocess since vllm output in each tp rank should be identical.
Joel committed -
#556 take effort to remove remove unnecessary empty_cache, but will cause CUDA oom at vllm wake_up. ```text File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/fsdp_workers.py", line 481, in generate_sequences with self.rollout_sharding_manager: File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/sharding_manager/fsdp_vllm.py", line 82, in __enter__ self.inference_engine.wake_up() File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py", line 1244, in wake_up self.llm_engine.wake_up() File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 1859, in wake_up self.model_executor.wake_up() File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 216, in wake_up self.collective_rpc("wake_up") File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2196, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 140, in wake_up allocator.wake_up() File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 207, in wake_up create_and_map(handle) File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 75, in create_and_map python_create_and_map(*allocation_handle) RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62 ``` This PR remove all redundant `torch.cuda.empty_cache()` in FSDP worker and only empty cache before vllm wake_up and after vllm sleep, since vllm has its own caching memory allocator [CuMemAllocator](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/device_allocator/cumem.py#L103). Out of vllm scope, we should avoid empty cache to let pytorch using caching memory to speed up memory allocations. - [x] Cleanup FSDP worker torch.cuda.empty_cache() - [ ] Cleanup Megatron worker torch.cuda.empty_cache()
Joel committed -
### Description - fix filter_overlong_prompts setting in PRIME - fix padding side incorrect for Qwen in PRIME - When I utilize PRIME recipe to train Qwen series models, I got “*ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Qwen2. Make sure to call tokenizer.padding_side = 'left' before tokenizing the input.*” So I set `use_cache = False` when calling model to calculate output logits. - fix CUDA error with vllm v0.6.3 - When I run PRIME, I may get an error — *CUDA error: an illegal memory access was encountered*. According to https://github.com/vllm-project/vllm/issues/10389, I set `VLLM_ATTENTION_BACKEND=XFORMERS` .
CajZella committed -
# Description - Corrected dummy size to avoid faulty communication. - Fixed batch number calculation. - Adjusted worker group role to alleviate memory overhead. - Add ray.init() to prevent failing to register worker.
Dai, Weinan committed -
Currently, eager mode is applied in the validation stage. However, in some reasoning tasks, we may need to generate n times and average the scores. In this PR, we support using non-eager sampling parameters during validation by specifying the `val_kwargs` in `actor_rollout_ref.rollout` config field. **Future work** - [ ] Merge `vllm_rollout_spmd.py` and `vllm_rollout.py` into one file.
Guangming Sheng committed
-
- 12 Mar, 2025 7 commits
-
-
Zheng-Yuxiang committed
-
BearBiscuit committed
-
As we're moving to vllm>=0.7.3, we should remove `verl/third_party` complelely in the future.
Joel committed -
# Description https://github.com/volcengine/verl/issues/287, https://github.com/volcengine/verl/issues/295. This PR introduces support for [Math-Verify](https://github.com/huggingface/Math-Verify) as a new rule-based reward scorer, significantly improving evaluation accuracy. # Key changes - Added `math-verify` to the installation dependencies. - Introduced `reward_score/math_verify.py` and updated `reward_score/__init__.py`. # Test Comparison between the existing scorer in math.py and the newly added `math_verify.py`, using Qwen2.5-Math-7B-Instruct: ``` # Use scorer in math.py (original) {'val/test_score/DigitalLearningGmbH/MATH-lighteval': 0.803} # Use scorer in math_verify.py (newly added) {'val/test_score/DigitalLearningGmbH/MATH-lighteval': 0.8338} ``` Test scripts: ```bash set -x # Data Process python examples/data_preprocess/math_dataset.py --local_dir /workspace/datasets/math # Evaluation export CUDA_VISIBLE_DEVICES=4,5,6,7 export VLLM_ATTENTION_BACKEND=XFORMERS math_train_path=/workspace/datasets/math/train.parquet math_test_path=/workspace/datasets/math/test.parquet python3 -m verl.trainer.main_ppo \ data.train_files="$math_train_path" \ data.val_files="$math_test_path" \ data.max_prompt_length=2048 \ data.max_response_length=2048 \ actor_rollout_ref.model.path=Qwen/Qwen2.5-Math-7B-Instruct \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=1 \ actor_rollout_ref.rollout.temperature=0 \ trainer.logger=['console'] \ trainer.project_name='test-math-verify' \ trainer.experiment_name='test-math-verify' \ +trainer.val_before_train=True \ trainer.n_gpus_per_node=4 \ trainer.nnodes=1 \ trainer.total_epochs=0 \ data.train_batch_size=1024 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ algorithm.adv_estimator=grpo $@ ```
Yuyang Ding committed -
Urgently update megatron core_r0.11.0 documentation.
Blue Space committed -
Chi Zhang committed
-
This PR removes several unnecessary `empty_cache` to improve efficiency. Credit to @PeterSH6
Shawn/Yuxuan Tong committed
-
- 11 Mar, 2025 2 commits
-
-
Guangming Sheng committed
-
1. add [PRIME](https://arxiv.org/abs/2502.01456) to README.md 2. slightly change the example script to align with the paper
Zefan Wang committed
-
- 10 Mar, 2025 3 commits
-
-
Refactor and merge PRIME algorithm into verl/main https://github.com/PRIME-RL/PRIME Breaking changes: `trainer.fsdp_config.min_num_params` is now moved to `trainer.fsdp_config.wrap_policy.min_num_params`.
Zefan Wang committed -
[bugfix] Fix position embedding processing for Qwen2.5-VL In the `RLHFDataset.__getitem__` method, a bug was identified in how multimodal position IDs (3D in Qwen2.5-VL) are determined. Previously, the code checked for `self.image_key in row_dict` to decide whether to use multimodal position IDs. However, since `self.image_key` is popped from `row_dict` during image token expansion, this check incorrectly fails for subsequent operations. This causes the VL model to use incorrect position IDs, resulting in significant performance degradation. <img width="349" alt="image" src="https://github.com/user-attachments/assets/79790bbf-239e-4667-a2c5-d63d91d63165" /> The fix introduces an explicit `is_multi_modal` flag to properly track multimodal content throughout the processing pipeline. Co-authored-by: songyifan <songyifan3@xiaomi.com>
Yifan Song committed -
Current bugs when enable hsdp: - **Incorrect Division in Batch Sizes** - `ppo_micro_batch`, `ppo_minibatch`, etc... should be divided by `self.device_mesh.size()` instead of `self.device_mesh.shape[0]`. - **Improper Weight Initialization** in `get_init_weight_context_manager` - The `get_init_weight_context_manager` function must initialize empty weights only on local_rank == 0 within every fsdp mesh. - When `sync_module_states=True`, PyTorch's FSDP first broadcasts parameters within the fsdp process group and then within the ddp process group. If weights are not initialized correctly on `local_rank == 0` of each fsdp mesh, the synchronization process may fail or produce incorrect results. https://github.com/pytorch/pytorch/blob/3f069e7679588d5ee4b1d5b2492ca0e20f9320b5/torch/distributed/fsdp/_init_utils.py#L614-L621 - Ensure initialization occurs only when `self.device_mesh.get_coordinate()[-1] == 0`, which corresponds to `local_rank == 0 `within each fsdp mesh.
zhr2001 committed
-
- 08 Mar, 2025 2 commits
-
-
Haosheng Zou (邹昊晟) committed
-
Lumeng Wu committed
-
- 07 Mar, 2025 8 commits
-
-
- [x] Add concurrency to workflows to cancel previous workflows when new commit is pushed to the same branch. - [ ] Cancel all workflows/jobs from the same commit if any fails? (Not sure whether we really need it) Note: we leave out `secrets_scan.yml` and `scorecard.yml` to avoid any possible leakage or security risk, which also cost little.
Shawn/Yuxuan Tong committed -
Since searching for an appropriate `simplify` algorithm may cause `sympy.simplify` to timeout, and `ProcessPool` may get stuck due to excessive concurrency, the timeout mechanism in `verl/verl/workers/reward_manager/prime.py` cannot capture the timeout. To address this issue, a timeout detection mechanism is added to `verl/verl/utils/reward_score/prime_math/__init__.py` for `sympy.simplify` to solve it easily.
Yuchen Zhang committed -
# Background In RLHFDataset, we filter out prompts that are too long. This requires apply_chat_template to the whole dataset, which is not scalable when the dataset is large. https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L132 Instead of performing filtering online, we probably want to move this process offline and add an assertion to avoid truncation or simply perform truncation Reference: #502 # Key Changes - Add an option `data.filter_overlong_prompts=True \` to enable the above data filtering. The default value is set to False, but we enable it for all the example scripts. - Add an option `data.truncation` to truncate the input_ids or prompt length if they exceed max_prompt_length. The default is 'error', which does not allow the max_prompt_length to be exceeded. The users should increase the max_prompt_length if throwing the error. You can also set `left` and `right`. ### Suggestion for large-scale dataset. For large-scale datasets, filtering overlong prompts could be time-consuming. You should set `data.filtering_overlong_prompts=False` and set `truncation='left'`. Also, please note that you should increase `data.max_prompt_length` to avoid over-truncation of the prompts.
Guangming Sheng committed -
zhou fan committed
-
close #503
Joel committed -
Verl's megatron core_r0.11.0 backend successfully tested with 3D parallelism with multiple bug fixed (#495) This PR combines multiple modifications. # QWen2.5 checkpoint saver bug fix Thanks for the efforts @uygnef contributed to #368 , we use the new saver for model loader and saver for 3D parallelism support. # Megatron backend 3D-parallelism test benches We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well as the CI workflows, all tested. # Bug Fix for 3D-parallelism Including configuration bugs as well as the module packing. Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the implementation with `torch.bmm`. # Fully migration to Megatron Core Now we only use Megatron core in verl, fully get rid of calling other components. If they are in need, please integrate them into `utils/megatron`. --------- Co-authored-by: uygnef <admin@fengyu.org>
Blue Space committed -
Willem Jiang committed
-
Joel committed
-
- 06 Mar, 2025 6 commits
-
-
This PR solves these 2 following problems. 1. Last step skipped `self.global_steps += 1` before if `self.global_steps >= self.total_training_steps` makes the last step skipped. We start from step 1, and we expect `self.total_training_steps` in total. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L999-L1001 When `self.global_steps == self.total_training_steps-1`: * we have only executed `self.total_training_steps-1` steps * `self.global_steps` is updated to `self.total_training_steps` * `self.global_steps >= self.total_training_steps` is satisfied, and the training ends. Therefore, we should put `self.global_steps += 1` at last 2. redundant validation and logging If `self.total_training_steps % self.config.trainer.test_freq == 0` : * `self._validate()` will be executed twice 1. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L984 2. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L1005 * logging will also be executed twice 1. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L985 and https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L997 2. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L1007
Lumeng Wu committed -
- Add allgather method to dataproto - Add tests - Replace existing raw allgather with this function
Chi Zhang committed -
Yusheng (Ethan) Su committed
-
In this PR, a `val_generations_to_log_to_swanlab` parameter has been added. When this parameter is set to 1, it supports logging the generated text from eval in SwanLab. @hiyouga --- This pull request introduces logging of validation generations to Swanlab in addition to Wandb. The changes include updates to several configuration files and the addition of a new logging method in the `ray_trainer.py` file. Key changes include: ### Configuration Updates: * Added `val_generations_to_log_to_swanlab` parameter to the `trainer` section in the following configuration files: * `examples/split_placement/config/ppo_trainer_split.yaml` * `verl/trainer/config/ppo_megatron_trainer.yaml` * `verl/trainer/config/ppo_trainer.yaml` ### Code Updates: * Added a new method `_maybe_log_val_generations_to_swanlab` to log validation samples to Swanlab in `verl/trainer/ppo/ray_trainer.py` * Updated the `_validate` method to call the new Swanlab logging method in `verl/trainer/ppo/ray_trainer.py` ---
Ze-Yi LIN committed -
### What does this PR do? In the `naive` mode, passing `extra_info` information for reward function calculation is supported(https://github.com/volcengine/verl/pull/266), but the support for the `prime` mode is missing. This will cause the reward functions that use `extra_info` to fail to produce correct results in the `prime` mode. This commit fixes this issue. ### Who can review? @PeterSH6 @vermouth1992 @hiyouga or other people who have the authority?
nomadlx committed -
Set timeout in CI to avoid infinite hang. close #468
Chi Zhang committed
-
- 05 Mar, 2025 2 commits
-
-
Chi Zhang committed
-
This pull request includes updates to the `docs/examples/config.rst` file to enhance the documentation for the `Trainer` configuration. The most important changes involve expanding the support for various logging platforms. Documentation updates: * [`docs/examples/config.rst`](diffhunk://#diff-f051f6df5187cb4805be686b3d10c480877a01e9a35ed98cd63cf8da6af03772L352-R354): Updated the descriptions for `trainer.project_name`, `trainer.experiment_name`, and `trainer.logger` to include support for additional logging platforms such as swanlab, mlflow, and tensorboard.
Ze-Yi LIN committed
-