Unverified Commit cef4c2de by ZSL98 Committed by GitHub

Update vLLM>=0.7 doc (#432)

Because of the ongoing updates in vLLM, I noticed that veRL currently
cannot integrate with the nightly build of vLLM directly. The new DP
feature in the nightly version can no longer be bypassed by simply
adjusting the `data_parallel_size` parameter, and resolving this
requires further investigation.

As a temporary workaround, I recommend a customized installation of vLLM
if the V1 engine is required. I have updated the relevant documentation
accordingly to reflect this guidance.
parent 021db112
...@@ -14,16 +14,15 @@ git clone https://github.com/volcengine/verl.git ...@@ -14,16 +14,15 @@ git clone https://github.com/volcengine/verl.git
cd verl cd verl
pip3 install -e . pip3 install -e .
# Install vLLM>=0.7 # Install the latest stable version of vLLM
# (Option1) pip3 install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly pip3 install vllm==0.7.3
# (Option2) pip3 install "vllm>=0.7.0"
# Install flash-attn # Install flash-attn
pip3 install flash-attn --no-build-isolation pip3 install flash-attn --no-build-isolation
``` ```
Note that if you are installing stable versions of vLLM (Option2), you need to make some tiny patches manually on vllm (/path/to/site-packages/vllm after installation) after the above steps: Note that if you are installing lower versions of vLLM (0.7.0, 0.7.1, 0.7.2), you need to make some tiny patches manually on vllm (/path/to/site-packages/vllm after installation) after the above steps:
- vllm/distributed/parallel_state.py: Remove the assertion below: - vllm/distributed/parallel_state.py: Remove the assertion below:
...@@ -40,8 +39,6 @@ if (world_size ...@@ -40,8 +39,6 @@ if (world_size
- vllm/executor/uniproc_executor.py: change `local_rank = rank` to `local_rank = int(os.environ["LOCAL_RANK"])` - vllm/executor/uniproc_executor.py: change `local_rank = rank` to `local_rank = int(os.environ["LOCAL_RANK"])`
- vllm/model_executor/model_loader/weight_utils.py: remove the `torch.cuda.empty_cache()` in `pt_weights_iterator` - vllm/model_executor/model_loader/weight_utils.py: remove the `torch.cuda.empty_cache()` in `pt_weights_iterator`
These modifications have already been merged into the main branch of vLLM. Thus nightly vLLM or building vLLM from source do not need these patches.
## Features ## Features
### Use cuda graph ### Use cuda graph
...@@ -56,10 +53,19 @@ actor_rollout_ref.rollout.free_cache_engine=False \ ...@@ -56,10 +53,19 @@ actor_rollout_ref.rollout.free_cache_engine=False \
For a typical job like examples/ppo_trainer/run_qwen2-7b_seq_balance.sh, the rollout generation time is 115 seconds with vLLM0.6.3, while it is 85 seconds with vLLM0.7.0. By enabling the cudagraph, the generation duration is further reduced to 62 seconds. For a typical job like examples/ppo_trainer/run_qwen2-7b_seq_balance.sh, the rollout generation time is 115 seconds with vLLM0.6.3, while it is 85 seconds with vLLM0.7.0. By enabling the cudagraph, the generation duration is further reduced to 62 seconds.
**Note:** Currently, if the `n` is greater than 1 in `SamplingParams` in vLLM>=0.7, there is a potential performance issue on the stability of rollout generation time (Some iterations would see generation time bursts). We are working with the vLLM team to check this issue. **Note:** Currently, if the `n` is greater than 1 in `SamplingParams` in vLLM>=0.7, there is a potential performance issue on the stability of rollout generation time (Some iterations would see generation time bursts) using vLLM's V0 Engine.
### Use vLLM V1 Engine
Using the vLLM V1 engine can avoid instability issues and achieve additional performance improvements. To use the V1 engine, you can first uninstall the previously installed vLLM and then follow the steps below to install the newer version.
### Other features in vLLM ```
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout eb24dc4
sed -i "903a\ data_parallel_size = world_size // pipeline_model_parallel_size // tensor_model_parallel_size" ./vllm/distributed/parallel_state.py
VLLM_USE_PRECOMPILED=1 pip install --editable .
```
1. **num_scheduler_step>1:** not supported yet (weight loading has not been aligned with `MultiStepModelRunner`) Then you can enable the V1 engine by setting `export VLLM_USE_V1=1`. In some benchmark tests, the V1 engine demonstrates a 1.5x speed improvement over the vLLM V0 engine.
2. **Prefix caching:** not supported yet (vLLM sleep mode does not support prefix caching) The stable support of the vLLM V1 engine will come soon.
3. **Chunked prefill:** supported \ No newline at end of file
...@@ -103,6 +103,7 @@ class vLLMRollout(BaseRollout): ...@@ -103,6 +103,7 @@ class vLLMRollout(BaseRollout):
disable_log_stats=config.disable_log_stats, disable_log_stats=config.disable_log_stats,
max_num_batched_tokens=max_num_batched_tokens, max_num_batched_tokens=max_num_batched_tokens,
enable_chunked_prefill=config.enable_chunked_prefill, enable_chunked_prefill=config.enable_chunked_prefill,
enable_prefix_caching=True,
) )
# Offload vllm model to reduce peak memory usage # Offload vllm model to reduce peak memory usage
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment