- 02 Mar, 2025 5 commits
-
-
Weizhe Chen committed
-
ZSL98 committed
-
Specify the IP address when calling the bind method.
Willem Jiang committed -
Guangming Sheng committed
-
Now APIs can be displayed: 
HL committed
-
- 01 Mar, 2025 2 commits
-
-
Lumeng Wu committed
-
Because of the ongoing updates in vLLM, I noticed that veRL currently cannot integrate with the nightly build of vLLM directly. The new DP feature in the nightly version can no longer be bypassed by simply adjusting the `data_parallel_size` parameter, and resolving this requires further investigation. As a temporary workaround, I recommend a customized installation of vLLM if the V1 engine is required. I have updated the relevant documentation accordingly to reflect this guidance.
ZSL98 committed
-
- 28 Feb, 2025 3 commits
-
-
Validation should not have shuffling.
Shawn/Yuxuan Tong committed -
This is an enhancement for the single batch strategy for `val_dataloader`, making https://github.com/volcengine/verl/pull/353 more robust.
Shawn/Yuxuan Tong committed -
Willem Jiang committed
-
- 27 Feb, 2025 6 commits
-
-
Add tensorboard in Tracking backends. The user can set the environment variable TENSORBOARD_DIR to specify the TensorBoard log path.
Hongji Zhu committed -
Chi Zhang committed
-
The current training script utilizes the same file during training and evaluation. It is surmised that this may be incorrect.
yaguang committed -
[ckpt] replace DataLoader with StatefulDataLoader to support resume training for SequentialSampler (#389) Try to resolve this [issue](https://github.com/volcengine/verl/issues/356). As suggested by this issue discussion, I replace default DataLoader with StatefulDataloader, which provides state_dict and load_state_dict methods that may support resuming the iterator position of mid-epoch checkpointing.
alexchiu committed -
Thanks: @HillZhang1999 - Related issue: https://github.com/volcengine/verl/issues/189 `[36m(main_task pid=3523385)[0m ValueError: max_num_batched_tokens (8192) is smaller than max_model_len (9216). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len.` When enable_chunked_prefill is activated, the aforementioned issue will be concealed. Please increase `max_num_batched_tokens` or `decrease max_model_len`.
Guangming Sheng committed -
Chi Zhang committed
-
- 26 Feb, 2025 2 commits
-
-
apis: add data proto to documentation page. use copy_to_local instead of copy_local_path_from_hdfs (#358)
HL committed -
- As titled
Guangming Sheng committed
-
- 25 Feb, 2025 4 commits
-
-
See issue: https://github.com/volcengine/verl/issues/342
Mingjie Liu committed -
#369 --------- Co-authored-by: Thom <zhangyi@zhangyideMacBook-Pro.local>
_T_L_R_ committed -
kriswang committed
-
Chi Zhang committed
-
- 24 Feb, 2025 5 commits
-
-
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
湛露先生 committed -
BearBiscuit committed
-
close #312 Add support for ulysses sp for transformers >= 0.48 I've tested transformers 0.45.0, 0.46.0, 0.47.0, 0.48.0 and 0.49.0, using sp=2 with the following script in my local env ```bash #!/bin/bash set -ex VERSIONS=("4.45.0" "4.46.0" "4.47.0" "4.48.0" "4.49.0") for version in "${VERSIONS[@]}"; do echo "Testing with Transformers version ${version}" echo "----------------------------------------" pip install "transformers==${version}" PYTHONPATH=./ torchrun --nproc_per_node=2 tests/model/test_transformers_ulysses.py echo "----------------------------------------" echo "Completed testing for version ${version}" echo "" done ```
zhou fan committed -
fix the issue[#331](https://github.com/volcengine/verl/issues/331)
BearBiscuit committed -
Validation datasets are sent to inference engines as a whole batch, which will schedule the memory themselves. - [x] Remove `val_batch_size` from examples - [x] Set default values of `val_batch_size` in configs as `null` and add DEPRECATED comments - [x] Add deprecation warnings about `val_batch_size` in `_validate_config`
Shawn/Yuxuan Tong committed
-
- 23 Feb, 2025 2 commits
-
-
Tracking backend support vemlp wandb --------- Co-authored-by: liudayuan.carrot <liudayuan.carrot@bytedance.com>
liudayuan-carrot committed -
Ikko Eltociear Ashimine committed
-
- 22 Feb, 2025 2 commits
-
-
HL committed
-
Implement RLOO algorithm according to https://arxiv.org/abs/2402.14740
Zefan Wang committed
-
- 21 Feb, 2025 2 commits
-
-
HL committed
-
This PR adds Ray Serve to the requirements to enable support for multi-node training. It addresses the issue described here: https://github.com/volcengine/verl/issues/87#issuecomment-2659493418 Co-authored-by: Yu Feng <fengyufengyu@didiglobal.com>
Yu Feng committed
-
- 20 Feb, 2025 1 commit
-
-
HL committed
-
- 19 Feb, 2025 6 commits
-
-
Support Qwen2 Megatron backend The code is primarily adapted from the llama folder, with modifications to use QKV bias and remove the rope_scaling of RoPE in `verl/models/qwen2/megatron/layers/parallel_attention.py`. - Train using Qwen2-7B-Instruct with PPO, GSM8k score can reach 0.87 at step 75. - not support saver now
Kinman Lei committed -
Chi Zhang committed
-
Willem Jiang committed
-
A working Slurm example adapted from https://docs.ray.io/en/latest/ray-core/starting-ray.html
Chenhui Zhang committed -
HL committed
-
Willem Jiang committed
-