1. 05 Feb, 2025 4 commits
  2. 04 Feb, 2025 3 commits
  3. 03 Feb, 2025 4 commits
  4. 02 Feb, 2025 1 commit
  5. 01 Feb, 2025 2 commits
  6. 31 Jan, 2025 4 commits
  7. 30 Jan, 2025 8 commits
  8. 29 Jan, 2025 3 commits
  9. 28 Jan, 2025 1 commit
  10. 27 Jan, 2025 10 commits
    • [perf] docs: fix typo · 54603cbd
      HL committed
    • docs: add news for doubao-1.5-pro · b2c6ff7a
      HL committed
    • Update README.md (#146) · 12b0b59e
      - Add link to performance tuning
      Chi Zhang committed
    • [misc] fix: gradient accumulation in seq balance and modify default vllm log level (#141) · 695bdbb0
      - Previous gradient accumulation value is computed by micro_batch_size,
      which is wrong when using dynamic_bsz
      - Fix ci script to avoid overlooking this issue
      - Change vLLM state log default value to True to disable log.
      - We will check the `self.config.actor.ppo_mini_batch_size %
      self.config.actor.ppo_micro_batch_size_per_gpu == 0` after normalization
      in fsdp_workers instead of in dp_actor and dp_critic.
      Guangming Sheng committed
    • [SFT] Support context parallelism for SFT (#132) · 077173f2
      # Add Sequence Parallelism and Padding Removal to SFT Trainer
      
      This PR adds sequence parallelism (SP) and padding removal optimizations
      to the SFT trainer, which can help improve training efficiency for large
      language models.
      
      ## Key Changes
      
      ### Core Features
      1. **Sequence Parallelism**: Added support for sequence parallelism
      through the Ulysses framework
         - Configurable via `ulysses_sequence_parallel_size` parameter
         - Properly handles data distribution across SP ranks
         - Maintains consistent loss computation across distributed setup
      
      2. **Padding Removal**: Added support for efficient handling of
      variable-length sequences
         - Enabled via `use_remove_padding` flag (requires SP to be enabled)
         - Uses flash-attention's padding removal utilities
         - Handles proper re-padding and loss computation
      
      3. **Training Improvements**:
         - Added label smoothing support to loss computation
         - Added progress bar with epoch information
         - Added RoPE scaling configuration support
         - Improved error messages for batch size validation
      
      ### Testing
      - Added comprehensive test suite (`test_trainer.py`) to verify:
      - Forward pass consistency between original and SP+rmpad implementations
        - Loss computation correctness across distributed setup
        - Proper handling of micro-batches
      
      ### Example Usage
      Added example script `examples/sft/gsm8k/run_qwen_05_sp2.sh`
      demonstrating how to use the new features with Qwen-2.5B model.
      
      ## Implementation Details
      - Uses device mesh for proper distributed training setup
      - Handles data distribution ensuring same sequences within SP groups but
      different across DP groups
      - Carefully manages backward pass timing with gradient checkpointing
      - Maintains compatibility with existing FSDP features
      
      ## Testing Instructions
      1. Run the example script with sequence parallelism:
      ```bash
      bash examples/sft/gsm8k/run_qwen_05_sp2.sh <nproc_per_node> <save_path>
      ```
      
      2. Run the test suite:
      ```bash tests/sft/run_sft_sp_loss_match.sh```
      
      
      ^^ These are PR description generated by [OpenHands](https://github.com/All-Hands-AI/OpenHands)
      
      ---------
      
      Co-authored-by: Jiayi Pan <i@jiayipan.me>
      Co-authored-by: openhands <openhands@all-hands.dev>
      Xingyao Wang committed
    • [VLLM] Set max_num_batched_tokens for vllm rollout (#140) · c99df03f
      We set `max_num_batched_tokens` in config `.rollout`, but they weren't
      actually being passed to VLLM -- causing potential insufficient use of
      GPUs.
      
      This PR:
      
      - properly pass `max_num_batched_tokens` from config to vLLM
      - set `disable_log_stats` to False, so vLLM performance information can
      be properly displayed (to spot issues)
      Xingyao Wang committed
    • [BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu (#136) · f2a76acd
      ## Summary
      
      This PR changes all the micro_batch_size to micro_batch_size_per_gpu.
      
      **The Core logic of setting batch size:**
      - **All algorithmic metrics** (train batch size, ppo mini batch size):
      are global (from the perspective of single-controller), which will be
      normalized in each Worker.
      - **All performance-related parameters** (micro batch size, max token
      length in dynamic batch size) are local parameters, which represent the
      data sizes per GPU (i.e., each Worker).
      
      ## Main Changes
      
      1. Change the scripts and config and delete the normalization for
      micro_bsz
      2. Fix CI for SFT
      Guangming Sheng committed
    • docs: add reference for tiny-zero · c17e6c62
      HL committed