fix: remove redundant torch.cuda.empty_cache() (#575)
#556 take effort to remove remove unnecessary empty_cache, but will
cause CUDA oom at vllm wake_up.
```text
File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/fsdp_workers.py", line 481, in generate_sequences
with self.rollout_sharding_manager:
File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/sharding_manager/fsdp_vllm.py", line 82, in __enter__
self.inference_engine.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py", line 1244, in wake_up
self.llm_engine.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 1859, in wake_up
self.model_executor.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 216, in wake_up
self.collective_rpc("wake_up")
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2196, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 140, in wake_up
allocator.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 207, in wake_up
create_and_map(handle)
File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 75, in create_and_map
python_create_and_map(*allocation_handle)
RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
```
This PR remove all redundant `torch.cuda.empty_cache()` in FSDP worker
and only empty cache before vllm wake_up and after vllm sleep, since
vllm has its own caching memory allocator
[CuMemAllocator](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/device_allocator/cumem.py#L103).
Out of vllm scope, we should avoid empty cache to let pytorch using
caching memory to speed up memory allocations.
- [x] Cleanup FSDP worker torch.cuda.empty_cache()
- [ ] Cleanup Megatron worker torch.cuda.empty_cache()
Showing
Please
register
or
sign in
to comment