Commits · c99df03fc47264b45be5b1ac44ce21f21763b6ac · ZhangXiaoyun / verl

27 Jan, 2025 5 commits

[VLLM] Set max_num_batched_tokens for vllm rollout (#140) · c99df03f

We set `max_num_batched_tokens` in config `.rollout`, but they weren't
actually being passed to VLLM -- causing potential insufficient use of
GPUs.

This PR:

- properly pass `max_num_batched_tokens` from config to vLLM
- set `disable_log_stats` to False, so vLLM performance information can
be properly displayed (to spot issues)

committed Jan 27, 2025

c99df03f Browse Files

[BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu (#136) · f2a76acd

## Summary

This PR changes all the micro_batch_size to micro_batch_size_per_gpu.

**The Core logic of setting batch size:**
- **All algorithmic metrics** (train batch size, ppo mini batch size):
are global (from the perspective of single-controller), which will be
normalized in each Worker.
- **All performance-related parameters** (micro batch size, max token
length in dynamic batch size) are local parameters, which represent the
data sizes per GPU (i.e., each Worker).

## Main Changes

1. Change the scripts and config and delete the normalization for
micro_bsz
2. Fix CI for SFT

committed Jan 27, 2025

f2a76acd Browse Files

docs: add reference for tiny-zero · c17e6c62
HL committed Jan 26, 2025

c17e6c62 Browse Files
docs: add ray and slurm link · e9549031
HL committed Jan 26, 2025

e9549031 Browse Files
[misc] fix: only return old_log_prob and temp to fix union problem in old_log_prob (#137) · 8bf9b95a
```
- As titled
```
Guangming Sheng committed Jan 27, 2025
8bf9b95a Browse Files

26 Jan, 2025 2 commits
- docs: update README.md typo (#139) · a63f4a0f
```
minor fix
```
  Ikko Eltociear Ashimine committed Jan 26, 2025
  a63f4a0f Browse Files
- [misc] fix nan in non_tensor_batch union (#135) · 23101c37
  Guangming Sheng committed Jan 26, 2025
  
  23101c37 Browse Files
25 Jan, 2025 1 commit

[SFT] feat: Add LoRA support for SFT (#127) · 6d96fda3

This PR adds support for LoRA (Low-Rank Adaptation) for efficient model
fine-tuning.

### Changes

1. Added LoRA configuration support in trainer config
2. Modified FSDP wrapping policy to handle LoRA modules
3. Integrated with existing FSDP training infrastructure
4. Added peft dependency
5. Removed unused ring_attn_utils.py

### Features

- Configurable LoRA rank and alpha parameters
- Target module specification for selective adaptation
- Compatible with FSDP sharding strategy

### Testing

Tested with Qwen2.5-0.5B-Instruct model on GSM8K dataset using the
provided example script.

### Dependencies

- Added `peft` package to requirements.txt

This PR is based on commit 902ddbe6 and has been merged with the latest
upstream main branch.

---------

Co-authored-by: Jiayi Pan <i@jiayipan.me>
Co-authored-by: openhands <openhands@all-hands.dev>

committed Jan 25, 2025

6d96fda3 Browse Files

24 Jan, 2025 7 commits
- [readme] docs: add links for GRPO · 22e93114
  HL committed Jan 24, 2025
  
  22e93114 Browse Files
- Update README.md (#130) · a4048b47
  Chi Zhang committed Jan 24, 2025
  
  a4048b47 Browse Files
- [ppo] refactor: refactor old_log_prob into a separate function (#129) · e974fb19
  Chi Zhang committed Jan 24, 2025
  
  e974fb19 Browse Files
- [ci] feat: add ci for sft trainer (#128) · fac99fa6
```
- Support training several iters in SFT trainer
- Add CI for SFT trainer to train one iter.
```
  Guangming Sheng committed Jan 24, 2025
  fac99fa6 Browse Files
- [perf] feat: support ref/rm offload (#121) · fefca417
```
- Force ref/rm to use CPUOffload. Fix root FSDP unit not reshard weights
after forward
- HSDP support is on hold and assert False right now.
```
  Chi Zhang committed Jan 24, 2025
  fefca417 Browse Files
- Revert "support using iteration to control the sft" · 2f1be790
```
This reverts commit 19840945.
```
  shengguangming committed Jan 24, 2025
  2f1be790 Browse Files
- support using iteration to control the sft · 19840945
  shengguangming committed Jan 24, 2025
  
  19840945 Browse Files
23 Jan, 2025 3 commits

[perf] feat: support meta device init and parallel load for fsdp (#123) · 884a7273

This PR supports:
- meta device init (which keeps the shared parameters)
- parallel pre-trained weight init for FSDP from huggingface checkpoint

---------

Co-authored-by: zhiqi.0 <zhiqi.0@bytedance.com>

committed Jan 23, 2025

884a7273 Browse Files

[algo] feat: support GRPO algorithm (#124) · cd52d8b3

- Implement KL loss, GRPO outcome adv, and utilize bon rollouts
- Provide scripts for deepseek and qwen on GSM8k. Can provide more for
other datasets.
- Support seq balance
- Train using qwen2-7b, GSM8k score can reach 0.89

committed Jan 23, 2025

cd52d8b3 Browse Files

[misc] fix: normalize batch size should divide sp size (#125) · 5b90cd7d
```
- The actual DP size when using SP is (DP // SP). As SP sets of GPUs
have the same sequence but different parts
```
Guangming Sheng committed Jan 23, 2025
5b90cd7d Browse Files

22 Jan, 2025 1 commit

[perf] enable multiproc dataloader in sft trainer (#122) · e611979a

- Without multiproc

Train 1/2: 1%|▍ | 20/3934 [01:38<5:14:50, 4.83s/it

Avg GPU utilization: 55%

- With multiproc

Train 1/2: 1%|▍ | 20/3934 [01:00<2:57:09, 2.72s/it]

Avg GPU utilization: 95%

committed Jan 22, 2025

e611979a Browse Files

21 Jan, 2025 3 commits
- [readme] doc: add more features · 52365618
  HL committed Jan 21, 2025
  
  52365618 Browse Files
- [perf] feat: Support dynamic batch size (#118) · 54051c85
```
Co-authored-by: shengguangming <shengguangming@bytedance.com>
```
  Chi Zhang committed Jan 21, 2025
  54051c85 Browse Files
- [misc] fix: super tiny fix mlflow error (#120) · b43dcd9c
```
Close #119 

(not tested locally yet but this looks safe, especially given that there
are CI now)
```
  fzyzcjy committed Jan 21, 2025
  b43dcd9c Browse Files
20 Jan, 2025 1 commit
- [readme] docs: add contact for hiring · 9f734014
  HL committed Jan 19, 2025
  
  9f734014 Browse Files
19 Jan, 2025 1 commit
- [misc] feat: support mfu calculation (#117) · 41f645db
  Chi Zhang committed Jan 19, 2025
  
  41f645db Browse Files
18 Jan, 2025 4 commits
- [dataproto] fix: add assertion for uneven chunk (#115) · 1ec5eb50
```
- forbid uneven chunk for DataProto
```
  Chi Zhang committed Jan 18, 2025
  1ec5eb50 Browse Files
- [perf] fix: set use_reentrant=False when enable gradient checkpointing (#114) · 5a94e14d
```
- Set use_reentrant=False to avoid duplicate allgather in backward when
gradient checkpointing is enabled.
- Optimize temperature computation by using inplace op
- Fix testing logics
```
  Chi Zhang committed Jan 18, 2025
  5a94e14d Browse Files
- [misc][Long Context] feat: support ulysses for long context training (#109) · e8eb9e4e
  Guangming Sheng committed Jan 18, 2025
  
  e8eb9e4e Browse Files
- [ci] fix: change VLLM_ATTENTION_BACKEND to XFORMERS to avoid illegal memory access (#113) · 594d80ad
  Chi Zhang committed Jan 17, 2025
  
  594d80ad Browse Files
17 Jan, 2025 3 commits
- [misc] chore: refactor and add several metrics (#111) · 018b0d73
```
- Add format script
- Move save_checkpoint to a separate function
- Add timing/step, response_length/clip_ratio, prompt_length/clip_ratio
and critic/vf_explained_var metrics
- The training step starts from 1
```
  Chi Zhang committed Jan 17, 2025
  018b0d73 Browse Files
- [ci] fix: add force stop in ray e2e ci to clean env (#112) · ff0c7ccd
```
- As titled
```
  Guangming Sheng committed Jan 17, 2025
  ff0c7ccd Browse Files
- Update README.md installation link · d0152e18
  Guangming Sheng committed Jan 17, 2025
  
  d0152e18 Browse Files
16 Jan, 2025 2 commits
- [readme] docs: add acknowledgement (#107) · 4566cfba
  HL committed Jan 16, 2025
  
  4566cfba Browse Files
- [misc] fix: fix license (#110) · a33a3bab
```
- fix license
- add license ci
```
  Chi Zhang committed Jan 17, 2025
  a33a3bab Browse Files
14 Jan, 2025 3 commits
- [misc] feat: add Ray Summit Youtube video link (#105) · c1d5e76f
```
As title
```
  Chi Zhang committed Jan 14, 2025
  c1d5e76f Browse Files
- refact: hybrid_engine dir to sharding_manager for more general representation (#103) · 6a9f6e19
  Guangming Sheng committed Jan 14, 2025
  
  6a9f6e19 Browse Files
- Fix loss value for gradient accumulation > 1 (#102) · e230de86
  hoshi-hiyouga committed Jan 14, 2025
  
  e230de86 Browse Files
13 Jan, 2025 2 commits

[misc] feat: support different flash_attn versions with variable num returns (#100) · 1facb9d2

* add ci

* fix reward model and write  more ci script

* support different flash_attn version with variable num returns

* update transformers rmpad workflow

* balance workload

* lint

* lint

committed Jan 13, 2025

1facb9d2 Browse Files

[misc] fix reward model issue with TokenClassification model and support running… · a0e8ed2c

[misc] fix reward model issue with TokenClassification model and support running particular steps instead of epochs (#99)

* support user specify training steps

* fix typo

* update ci

* add ci

* fix reward model and write  more ci script

* update ci

* lint

* align

* delete post training val

* fix script

committed Jan 13, 2025

a0e8ed2c Browse Files

12 Jan, 2025 1 commit
- fix readme and add back citation (#98) · 53c3ff4a
  Guangming Sheng committed Jan 12, 2025
  
  53c3ff4a Browse Files
11 Jan, 2025 1 commit
- [example] fix: fix notebook link due to username update (#94) · e08c428c
```
* update lightning link

* Update verl_getting_started.ipynb
```
  HL committed Jan 11, 2025
  e08c428c Browse Files