1. 14 Mar, 2025 4 commits
  2. 13 Mar, 2025 6 commits
    • fix: remove redundant broadcast in fsdp vllm postprocess (#577) · f7e183e4
      Remove redundant broadcast in fsdp vllm postprocess since vllm output in
      each tp rank should be identical.
      Joel committed
    • fix: remove redundant torch.cuda.empty_cache() (#575) · 3fc3e2b7
      #556 take effort to remove remove unnecessary empty_cache, but will
      cause CUDA oom at vllm wake_up.
      ```text
        File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/fsdp_workers.py", line 481, in generate_sequences
          with self.rollout_sharding_manager:
        File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/sharding_manager/fsdp_vllm.py", line 82, in __enter__
          self.inference_engine.wake_up()
        File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py", line 1244, in wake_up
          self.llm_engine.wake_up()
        File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 1859, in wake_up
          self.model_executor.wake_up()
        File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 216, in wake_up
          self.collective_rpc("wake_up")
        File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
          answer = run_method(self.driver_worker, method, args, kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2196, in run_method
          return func(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 140, in wake_up
          allocator.wake_up()
        File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 207, in wake_up
          create_and_map(handle)
        File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 75, in create_and_map
          python_create_and_map(*allocation_handle)
      RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
      ```
      This PR remove all redundant `torch.cuda.empty_cache()` in FSDP worker
      and only empty cache before vllm wake_up and after vllm sleep, since
      vllm has its own caching memory allocator
      [CuMemAllocator](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/device_allocator/cumem.py#L103).
      Out of vllm scope, we should avoid empty cache to let pytorch using
      caching memory to speed up memory allocations.
      
      - [x] Cleanup FSDP worker torch.cuda.empty_cache()
      - [ ] Cleanup Megatron worker torch.cuda.empty_cache()
      Joel committed
    • [bugfix] PRIME filter overlong propmts & padding side incorrect & use xformers (#570) · 9bb02d27
      ### Description
      - fix filter_overlong_prompts setting in PRIME
      
      - fix padding side incorrect for Qwen in PRIME 
      
      - When I utilize PRIME recipe to train Qwen series models, I got
      “*ValueError: You are attempting to perform batched generation with
      padding_side='right' this may lead to unexpected behaviour for Flash
      Attention version of Qwen2. Make sure to call tokenizer.padding_side =
      'left' before tokenizing the input.*” So I set `use_cache = False` when
      calling model to calculate output logits.
      
      - fix CUDA error with vllm v0.6.3 
      
      - When I run PRIME, I may get an error — *CUDA error: an illegal memory
      access was encountered*. According to
      https://github.com/vllm-project/vllm/issues/10389, I set
      `VLLM_ATTENTION_BACKEND=XFORMERS` .
      CajZella committed
    • [bugfix] fix: generation script (#542) · 79e072f1
      # Description
      - Corrected dummy size to avoid faulty communication.
      - Fixed batch number calculation.
      - Adjusted worker group role to alleviate memory overhead.
      - Add ray.init() to prevent failing to register worker.
      Dai, Weinan committed
    • [rollout] feat: support sampling in validation stage (#553) · d5de9f4c
      Currently, eager mode is applied in the validation stage. However, in
      some reasoning tasks, we may need to generate n times and average the
      scores.
      
      In this PR, we support using non-eager sampling parameters during
      validation by specifying the `val_kwargs` in `actor_rollout_ref.rollout`
      config field.
      
      
      **Future work**
      - [ ] Merge `vllm_rollout_spmd.py` and `vllm_rollout.py` into one file.
      Guangming Sheng committed
  3. 12 Mar, 2025 7 commits
  4. 11 Mar, 2025 2 commits
  5. 10 Mar, 2025 3 commits
  6. 08 Mar, 2025 2 commits
  7. 07 Mar, 2025 8 commits
  8. 06 Mar, 2025 6 commits
  9. 05 Mar, 2025 2 commits