Unverified Commit ced8ecbf by HL Committed by GitHub

example: switch the default model ckpt for Megatron, add wandb logs (#210)

use the general purpose LLM for the math task instead of code LLM.

---------

Co-authored-by: Your Name <you@example.com>
parent 22d56a8b
......@@ -69,7 +69,7 @@ Checkout this [Jupyter Notebook](https://github.com/volcengine/verl/tree/main/ex
- [Run GSM8K Example](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html)
**Reproducible algorithm baselines:**
- [PPO](https://verl.readthedocs.io/en/latest/experiment/ppo.html)
- [PPO and GRPO](https://verl.readthedocs.io/en/latest/experiment/ppo.html)
**For code explanation and advance usage (extension):**
- PPO Trainer and Workers
......
# veRL documents
# verl documents
## Build the docs
......
......@@ -3,7 +3,7 @@ Extend to other RL(HF) algorithms
We already implemented the complete training pipeline of the PPO
algorithms. To extend to other algorithms, we analyze the high-level
principle to use veRL and provide a tutorial to implement the DPO
principle to use verl and provide a tutorial to implement the DPO
algorithm. Users can follow the similar paradigm to extend to other RL algorithms.
.. note:: **Key ideas**: Single process drives multi-process computation and data communication.
......@@ -26,7 +26,7 @@ Step 3: Utilize the encapsulated APIs to implement the control flow
Example: Online DPO
-------------------
We use veRL to implement a simple online DPO algorithm. The algorithm
We use verl to implement a simple online DPO algorithm. The algorithm
flow of Online DPO is as follows:
1. There is a prompt (rollout) generator which has the same weight as
......@@ -178,7 +178,7 @@ steps:
and merge them.
Frequently calling these 3 steps on the controller process greatly hurts
code readability. **In veRL, we have abstracted and encapsulated these 3
code readability. **In verl, we have abstracted and encapsulated these 3
steps, so that the worker's method + dispatch + collect can be
registered into the worker_group**
......
......@@ -31,7 +31,7 @@
# -- Project information -----------------------------------------------------
project = u'veRL'
project = u'verl'
# pylint: disable=W0622
copyright = u'2024 ByteDance Seed Foundation MLSys Team'
author = u'Guangming Sheng, Chi Zhang, Yanghua Peng, Haibin Lin'
......
......@@ -200,7 +200,7 @@ Define, init and run the PPO Trainer
on the allocated GPUs (in the resource pool)
- The actual PPO training will be executed in ``trainer.fit()``
veRL can be easily extended to other RL algorithms by reusing the Ray
verl can be easily extended to other RL algorithms by reusing the Ray
model workers, resource pool and reward functions. See :doc:`extension<../advance/dpo_extension>` for
more information.
......
......@@ -11,22 +11,32 @@ Assuming GSM8k dataset is preprocess via ``python3 examples/data_preprocess/gsm8
Refer to the table below to reproduce PPO training from different pre-trained models.
.. _Huggingface: https://huggingface.co/google/gemma-2-2b-it#benchmark-results
.. _SFT Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
.. _SFT+PPO Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
.. _SFT Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
.. _SFT+PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
.. _wandb: https://api.wandb.ai/links/verl-team/h7ux8602
.. _Qwen Blog: https://qwenlm.github.io/blog/qwen2.5-llm/
.. _PPO Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Model | Method | Test score | Details |
+============================+========================+============+=====================+=========================================================================+
| google/gemma-2-2b-it | pretrained checkpoint | 23.9 | `Huggingface`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| google/gemma-2-2b-it | SFT | 52.06 | `SFT Command and logs`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| google/gemma-2-2b-it | SFT + PPO | 64.02 | `SFT+PPO Command and logs`_, `wandb`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint | 36.4 | `Qwen Blog`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | `PPO Command and logs`_ |
+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
\ No newline at end of file
.. _PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
.. _Megatron PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/deepseek-llm-7b-chat-megatron-bsz256_4-prompt512-resp512-0.695.log
.. _Qwen7b GRPO Script: https://github.com/volcengine/verl/blob/a65c9157bc0b85b64cd753de19f94e80a11bd871/examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
.. _Megatron wandb: https://wandb.ai/verl-team/verl_megatron_gsm8k_examples/runs/10fetyr3
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Model | Method | Test score | Details |
+==================================+========================+============+=====================+=========================================================================+
| google/gemma-2-2b-it | pretrained checkpoint | 23.9 | `Huggingface`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| google/gemma-2-2b-it | SFT | 52.06 | `SFT Command and Logs`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| google/gemma-2-2b-it | SFT + PPO | 64.02 | `SFT+PPO Command and Logs`_, `wandb`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint | 36.4 | `Qwen Blog`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | `PPO Command and Logs`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| deepseek-ai/deepseek-llm-7b-chat | PPO | 69.5 [1]_ | `Megatron PPO Command and Logs`_, `Megatron wandb`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
| Qwen/Qwen2-7B-Instruct | GRPO | 89 | `Qwen7b GRPO Script`_ |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
.. [1] During the evaluation, we have only extracted answers following the format "####". A more flexible answer exaction, longer response length and better prompt engineering may lead to higher score.
\ No newline at end of file
Welcome to veRL's documentation!
Welcome to verl's documentation!
================================================
.. _hf_arxiv: https://arxiv.org/pdf/2409.19256
veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.
verl is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.
veRL is flexible and easy to use with:
verl is flexible and easy to use with:
- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
......@@ -16,9 +16,9 @@ veRL is flexible and easy to use with:
- Readily integration with popular HuggingFace models
veRL is fast with:
verl is fast with:
- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, veRL achieves high generation and training throughput.
- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, verl achieves high generation and training throughput.
- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
......@@ -92,7 +92,7 @@ veRL is fast with:
Contribution
-------------
veRL is free software; you can redistribute it and/or modify it under the terms
verl is free software; you can redistribute it and/or modify it under the terms
of the Apache License 2.0. We welcome contributions.
Join us on `GitHub <https://github.com/volcengine/verl>`_, `Slack <https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA>`_ and `Wechat <https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG>`_ for discussions.
......
Performance Tuning Guide
=========================
In this section, we will discuss how to tune the performance of all the stages in veRL, including:
In this section, we will discuss how to tune the performance of all the stages in verl, including:
1. Rollout generation throughput.
......@@ -16,7 +16,7 @@ In this section, we will discuss how to tune the performance of all the stages i
Rollout Generation Tuning
--------------------------
veRL currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon).
verl currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon).
Below are key factors for tuning vLLM-based rollout. Before tuning, we recommend setting ``actor_rollout_ref.rollout.disable_log_stats=False`` so that rollout statistics are logged.
......@@ -45,7 +45,7 @@ Batch Size Tuning
To achieve higher throughput in experience preparation (i.e., model fwd) and model update (i.e., actor/critic fwd/bwd),
users may need to tune the ``*micro_batch_size_per_gpu`` for different computation.
In veRL, the core principle for setting batch sizes is:
In verl, the core principle for setting batch sizes is:
- **Algorithmic metrics** (train batch size, PPO mini-batch size) are *global* (from a single-controller perspective),
normalized in each worker. See the `normalization code <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py#L120-L122>`_.
......
......@@ -7,7 +7,7 @@ Requirements
- **Python**: Version >= 3.9
- **CUDA**: Version >= 12.1
veRL supports various backends. Currently, the following configurations are available:
verl supports various backends. Currently, the following configurations are available:
- **FSDP** and **Megatron-LM** (optional) for training.
- **vLLM** adn **TGI** for rollout generation, **SGLang** support coming soon.
......@@ -34,7 +34,7 @@ Image and tag: ``verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3`
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v <image:tag>
2. Inside the container, install veRL:
2. Inside the container, install verl:
.. code:: bash
......@@ -74,7 +74,7 @@ To manage environment, we recommend using conda:
conda create -n verl python==3.9
conda activate verl
For installing the latest version of veRL, the best way is to clone and
For installing the latest version of verl, the best way is to clone and
install it from source. Then you can modify our code to customize your
own post-training jobs.
......@@ -85,7 +85,7 @@ own post-training jobs.
cd verl
pip3 install -e .
You can also install veRL using ``pip3 install``
You can also install verl using ``pip3 install``
.. code:: bash
......@@ -95,9 +95,9 @@ You can also install veRL using ``pip3 install``
Dependencies
------------
veRL requires Python >= 3.9 and CUDA >= 12.1.
verl requires Python >= 3.9 and CUDA >= 12.1.
veRL support various backend, we currently release FSDP and Megatron-LM
verl support various backend, we currently release FSDP and Megatron-LM
for actor training and vLLM for rollout generation.
The following dependencies are required for all backends, PyTorch FSDP and Megatron-LM.
......
set -x
# prepare pre-trained model ckpt
huggingface-cli download deepseek-ai/deepseek-llm-7b-chat --local-dir $HOME/models/deepseek-llm-7b-chat
# ``actor_rollout_ref.rollout.tensor_model_parallel_size`` in theory could be different from
# ``**.megatron.tensor_model_parallel_size``
# the config file used: verl/trainer/main_ppo/config/ppo_megatron_trainer.yaml
python3 -m verl.trainer.main_ppo --config-path=config \
......@@ -10,19 +16,22 @@ python3 -m verl.trainer.main_ppo --config-path=config \
data.val_batch_size=1312 \
data.max_prompt_length=512 \
data.max_response_length=512 \
actor_rollout_ref.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
actor_rollout_ref.model.path=$HOME/models/deepseek-llm-7b-chat \
actor_rollout_ref.actor.optim.lr=2e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
critic.optim.lr=2e-5 \
critic.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
critic.model.path=$HOME/models/deepseek-llm-7b-chat \
critic.model.enable_gradient_checkpointing=False \
critic.ppo_micro_batch_size_per_gpu=8 \
critic.ppo_micro_batch_size_per_gpu=4 \
critic.megatron.tensor_model_parallel_size=4 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
......
......@@ -8,13 +8,13 @@
"source": [
"# Run Qwen PPO with [verl](https://github.com/volcengine/verl)\n",
"\n",
"This tutorial provides a step-by-step guide to using veRL for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.\n",
"This tutorial provides a step-by-step guide to using verl for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.\n",
"\n",
"This notebook is also published on the [Lightning Studio](https://lightning.ai/hlin-verl/studios/verl-getting-started) platform, which provides free GPU quota every month. Checkout the published notebook with pre-installed dependencies using a free L4 GPU [here](https://lightning.ai/hlin-verl/studios/verl-getting-started) (no credit card required).\n",
"\n",
"### You will learn:\n",
"\n",
"- How to install veRL from scratch.\n",
"- How to install verl from scratch.\n",
"- How to use existing scripts to run an RLHF pipeline with your own models and data."
]
},
......
......@@ -18,7 +18,7 @@ name = "verl"
# The actual version is specified in the [tool.setuptools.dynamic] section below.
dynamic = ["version"]
description = "veRL: Volcano Engine Reinforcement Learning for LLM"
description = "verl: Volcano Engine Reinforcement Learning for LLM"
license = {file = "LICENSE"} # or "Apache-2.0", if you prefer an SPDX identifier
readme = {file = "README.md", content-type = "text/markdown"}
requires-python = ">=3.8"
......
......@@ -43,7 +43,7 @@ setup(
license='Apache 2.0',
author='Bytedance - Seed - MLSys',
author_email='zhangchi.usc1992@bytedance.com, gmsheng@connect.hku.hk',
description='veRL: Volcano Engine Reinforcement Learning for LLM',
description='verl: Volcano Engine Reinforcement Learning for LLM',
install_requires=install_requires,
extras_require=extras_require,
package_data={'': ['version/*'],
......
......@@ -206,7 +206,7 @@ def initialize_model_parallel(
backend = backend or torch.distributed.get_backend()
# NOTE(sgm) we don't assert world_size == tp * pp
# DP is not managed by vllm but by the veRL WorkerGroup
# DP is not managed by vllm but by the verl WorkerGroup
num_tensor_model_parallel_groups: int = (world_size // tensor_model_parallel_size)
num_pipeline_model_parallel_groups: int = (world_size // pipeline_model_parallel_size)
......
......@@ -224,7 +224,7 @@ def initialize_model_parallel(
backend = backend or torch.distributed.get_backend(ps.get_world_group().device_group)
# NOTE(sgm) we don't assert world_size == tp * pp
# DP is not managed by vllm but by the veRL WorkerGroup
# DP is not managed by vllm but by the verl WorkerGroup
# if (world_size !=
# tensor_model_parallel_size * pipeline_model_parallel_size):
# raise RuntimeError(
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment