Update vLLM>=0.7 doc (#432)

Because of the ongoing updates in vLLM, I noticed that veRL currently cannot integrate with the nightly build of vLLM directly. The new DP feature in the nightly version can no longer be bypassed by simply adjusting the `data_parallel_size` parameter, and resolving this requires further investigation. As a temporary workaround, I recommend a customized installation of vLLM if the V1 engine is required. I have updated the relevant documentation accordingly to reflect this guidance.

Update vLLM>=0.7 doc (#432)
Because of the ongoing updates in vLLM, I noticed that veRL currently cannot integrate with the nightly build of vLLM directly. The new DP feature in the nightly version can no longer be bypassed by simply adjusting the `data_parallel_size` parameter, and resolving this requires further investigation. As a temporary workaround, I recommend a customized installation of vLLM if the V1 engine is required. I have updated the relevant documentation accordingly to reflect this guidance.
cef4c2de · ZSL98 · GitHub · 021db112 · cef4c2de · cef4c2de
Unverified Commit cef4c2de authored Mar 01, 2025 by ZSL98 Committed by GitHub Mar 01, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 19 additions and 11 deletions

docs/README_vllm0.7.md
+18 -11

verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
+1 -0

No files found.
--- a/docs/README_vllm0.7.md
+++ b/docs/README_vllm0.7.md
@@ -14,16 +14,15 @@ git clone https://github.com/volcengine/verl.git
 cd verl
 pip3 install -e .
-# Install vLLM>=0.7
+# Install the latest stable version of vLLM
-# (Option1) pip3 install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+pip3 install vllm==0.7.3 
-# (Option2) pip3 install "vllm>=0.7.0" 
 # Install flash-attn
 pip3 install flash-attn --no-build-isolation
 ```
-Note that if you are installing stable versions of vLLM (Option2), you need to make some tiny patches manually on vllm (/path/to/site-packages/vllm after installation) after the above steps:
+Note that if you are installing lower versions of vLLM (0.7.0, 0.7.1, 0.7.2), you need to make some tiny patches manually on vllm (/path/to/site-packages/vllm after installation) after the above steps:
 - vllm/distributed/parallel_state.py: Remove the assertion below:
@@ -40,8 +39,6 @@ if (world_size
 - vllm/executor/uniproc_executor.py: change `local_rank = rank` to `local_rank = int(os.environ["LOCAL_RANK"])`
 - vllm/model_executor/model_loader/weight_utils.py: remove the `torch.cuda.empty_cache()` in `pt_weights_iterator`
-These modifications have already been merged into the main branch of vLLM. Thus nightly vLLM or building vLLM from source do not need these patches.
 ## Features
 ### Use cuda graph
@@ -56,10 +53,19 @@ actor_rollout_ref.rollout.free_cache_engine=False \
 For a typical job like examples/ppo_trainer/run_qwen2-7b_seq_balance.sh, the rollout generation time is 115 seconds with vLLM0.6.3, while it is 85 seconds with vLLM0.7.0. By enabling the cudagraph, the generation duration is further reduced to 62 seconds.
-**Note:** Currently, if the `n` is greater than 1 in `SamplingParams` in vLLM>=0.7, there is a potential performance issue on the stability of rollout generation time (Some iterations would see generation time bursts). We are working with the vLLM team to check this issue.
+**Note:** Currently, if the `n` is greater than 1 in `SamplingParams` in vLLM>=0.7, there is a potential performance issue on the stability of rollout generation time (Some iterations would see generation time bursts) using vLLM's V0 Engine.
+### Use vLLM V1 Engine
+Using the vLLM V1 engine can avoid instability issues and achieve additional performance improvements. To use the V1 engine, you can first uninstall the previously installed vLLM and then follow the steps below to install the newer version.
-### Other features in vLLM
+```
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout eb24dc4
+sed -i "903a\    data_parallel_size = world_size // pipeline_model_parallel_size // tensor_model_parallel_size" ./vllm/distributed/parallel_state.py
+VLLM_USE_PRECOMPILED=1 pip install --editable .
+```
-1. **num_scheduler_step>1:** not supported yet (weight loading has not been aligned with `MultiStepModelRunner`)
+Then you can enable the V1 engine by setting `export VLLM_USE_V1=1`. In some benchmark tests, the V1 engine demonstrates a 1.5x speed improvement over the vLLM V0 engine.
-2. **Prefix caching:** not supported yet (vLLM sleep mode does not support prefix caching)
+The stable support of the vLLM V1 engine will come soon.
-3. **Chunked prefill:** supported
\ No newline at end of file
--- a/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
+++ b/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py
@@ -103,6 +103,7 @@ class vLLMRollout(BaseRollout):
            disable_log_stats=config.disable_log_stats,
            max_num_batched_tokens=max_num_batched_tokens,
            enable_chunked_prefill=config.enable_chunked_prefill,
+            enable_prefix_caching=True,
        )
        # Offload vllm model to reduce peak memory usage