example: switch the default model ckpt for Megatron, add wandb logs (#210)

use the general purpose LLM for the math task instead of code LLM. --------- Co-authored-by: Your Name <you@example.com>

example: switch the default model ckpt for Megatron, add wandb logs (#210)
use the general purpose LLM for the math task instead of code LLM. --------- Co-authored-by: Your Name <you@example.com>
ced8ecbf · HL · GitHub · 22d56a8b · ced8ecbf · ced8ecbf
Unverified Commit ced8ecbf authored Feb 05, 2025 by HL Committed by GitHub Feb 05, 2025
15 changed files
--- a/README.md
+++ b/README.md
@@ -69,7 +69,7 @@ Checkout this [Jupyter Notebook](https://github.com/volcengine/verl/tree/main/ex
  - [Run GSM8K Example](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html)

 **Reproducible algorithm baselines:**
- [PPO](https://verl.readthedocs.io/en/latest/experiment/ppo.html)
+- [PPO and GRPO](https://verl.readthedocs.io/en/latest/experiment/ppo.html)

 **For code explanation and advance usage (extension):**
 - PPO Trainer and Workers

--- a/docs/README.md
+++ b/docs/README.md
-# veRL documents
+# verl documents

 ## Build the docs


--- a/docs/advance/dpo_extension.rst
+++ b/docs/advance/dpo_extension.rst
@@ -3,7 +3,7 @@ Extend to other RL(HF) algorithms

 We already implemented the complete training pipeline of the PPO
 algorithms. To extend to other algorithms, we analyze the high-level
-principle to use veRL and provide a tutorial to implement the DPO
+principle to use verl and provide a tutorial to implement the DPO
 algorithm. Users can follow the similar paradigm to extend to other RL algorithms.

 .. note:: **Key ideas**: Single process drives multi-process computation and data communication.
@@ -26,7 +26,7 @@ Step 3: Utilize the encapsulated APIs to implement the control flow
 Example: Online DPO
 -------------------

-We use veRL to implement a simple online DPO algorithm. The algorithm
+We use verl to implement a simple online DPO algorithm. The algorithm
 flow of Online DPO is as follows:

 1. There is a prompt (rollout) generator which has the same weight as
@@ -178,7 +178,7 @@ steps:
   and merge them.

 Frequently calling these 3 steps on the controller process greatly hurts
-code readability. **In veRL, we have abstracted and encapsulated these 3
+code readability. **In verl, we have abstracted and encapsulated these 3
 steps, so that the worker's method + dispatch + collect can be
 registered into the worker_group**


--- a/docs/conf.py
+++ b/docs/conf.py
@@ -31,7 +31,7 @@

 # -- Project information -----------------------------------------------------

-project = u'veRL'
+project = u'verl'
 # pylint: disable=W0622
 copyright = u'2024 ByteDance Seed Foundation MLSys Team'
 author = u'Guangming Sheng, Chi Zhang, Yanghua Peng, Haibin Lin'

--- a/docs/examples/ppo_code_architecture.rst
+++ b/docs/examples/ppo_code_architecture.rst
@@ -200,7 +200,7 @@ Define, init and run the PPO Trainer
  on the allocated GPUs (in the resource pool)
 - The actual PPO training will be executed in ``trainer.fit()``

-veRL can be easily extended to other RL algorithms by reusing the Ray
+verl can be easily extended to other RL algorithms by reusing the Ray
 model workers, resource pool and reward functions. See :doc:`extension<../advance/dpo_extension>` for
 more information.


--- a/docs/experiment/ppo.rst
+++ b/docs/experiment/ppo.rst
@@ -11,22 +11,32 @@ Assuming GSM8k dataset is preprocess via ``python3 examples/data_preprocess/gsm8
 Refer to the table below to reproduce PPO training from different pre-trained models.

 .. _Huggingface: https://huggingface.co/google/gemma-2-2b-it#benchmark-results
-.. _SFT Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
-.. _SFT+PPO Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
+.. _SFT Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
+.. _SFT+PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
 .. _wandb: https://api.wandb.ai/links/verl-team/h7ux8602
 .. _Qwen Blog: https://qwenlm.github.io/blog/qwen2.5-llm/
-.. _PPO Command and logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
-
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| Model                      | Method                 | Test score |  Details                                                                                      |
-+============================+========================+============+=====================+=========================================================================+
-| google/gemma-2-2b-it       | pretrained checkpoint  | 23.9       |   `Huggingface`_                                                                              |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| google/gemma-2-2b-it       | SFT                    | 52.06      |   `SFT Command and logs`_                                                                     |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| google/gemma-2-2b-it       | SFT + PPO              | 64.02      |   `SFT+PPO Command and logs`_, `wandb`_                                                       |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint  | 36.4       |   `Qwen Blog`_                                                                                |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
-| Qwen/Qwen2.5-0.5B-Instruct | PPO                    | 56.7       |   `PPO Command and logs`_                                                                     |
-+----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
\ No newline at end of file
+.. _PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
+.. _Megatron PPO Command and Logs: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/deepseek-llm-7b-chat-megatron-bsz256_4-prompt512-resp512-0.695.log
+.. _Qwen7b GRPO Script: https://github.com/volcengine/verl/blob/a65c9157bc0b85b64cd753de19f94e80a11bd871/examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
+.. _Megatron wandb: https://wandb.ai/verl-team/verl_megatron_gsm8k_examples/runs/10fetyr3
+
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Model                            | Method                 | Test score |  Details                                                                                      |
+==================================+========================+============+=====================+=========================================================================+
+| google/gemma-2-2b-it             | pretrained checkpoint  | 23.9       |   `Huggingface`_                                                                              |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| google/gemma-2-2b-it             | SFT                    | 52.06      |   `SFT Command and Logs`_                                                                     |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| google/gemma-2-2b-it             | SFT + PPO              | 64.02      |   `SFT+PPO Command and Logs`_, `wandb`_                                                       |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct       | pretrained checkpoint  | 36.4       |   `Qwen Blog`_                                                                                |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct       | PPO                    | 56.7       |   `PPO Command and Logs`_                                                                     |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| deepseek-ai/deepseek-llm-7b-chat | PPO                    | 69.5 [1]_  |   `Megatron PPO Command and Logs`_, `Megatron wandb`_                                         |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2-7B-Instruct           | GRPO                   | 89         |   `Qwen7b GRPO Script`_                                                                       |
+----------------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+
+
+.. [1] During the evaluation, we have only extracted answers following the format "####". A more flexible answer exaction, longer response length and better prompt engineering may lead to higher score.
\ No newline at end of file
--- a/docs/index.rst
+++ b/docs/index.rst
-Welcome to veRL's documentation!
+Welcome to verl's documentation!
 ================================================

 .. _hf_arxiv: https://arxiv.org/pdf/2409.19256

-veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.
+verl is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.

-veRL is flexible and easy to use with:
+verl is flexible and easy to use with:

 - **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.

@@ -16,9 +16,9 @@ veRL is flexible and easy to use with:
 - Readily integration with popular HuggingFace models


-veRL is fast with:
+verl is fast with:

- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, veRL achieves high generation and training throughput.
+- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, verl achieves high generation and training throughput.

 - **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.

@@ -92,7 +92,7 @@ veRL is fast with:
 Contribution
 -------------

-veRL is free software; you can redistribute it and/or modify it under the terms
+verl is free software; you can redistribute it and/or modify it under the terms
 of the Apache License 2.0. We welcome contributions.
 Join us on `GitHub <https://github.com/volcengine/verl>`_, `Slack <https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA>`_ and `Wechat <https://raw.githubusercontent.com/eric-haibin-lin/verl-community/refs/heads/main/WeChat.JPG>`_ for discussions.


--- a/docs/perf/perf_tuning.rst
+++ b/docs/perf/perf_tuning.rst
 Performance Tuning Guide
 =========================

-In this section, we will discuss how to tune the performance of all the stages in veRL, including:
+In this section, we will discuss how to tune the performance of all the stages in verl, including:

 1. Rollout generation throughput.

@@ -16,7 +16,7 @@ In this section, we will discuss how to tune the performance of all the stages i
 Rollout Generation Tuning
 --------------------------

-veRL currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon). 
+verl currently supports two rollout backends: vLLM and TGI (with SGLang support coming soon). 

 Below are key factors for tuning vLLM-based rollout. Before tuning, we recommend setting ``actor_rollout_ref.rollout.disable_log_stats=False`` so that rollout statistics are logged.

@@ -45,7 +45,7 @@ Batch Size Tuning
 To achieve higher throughput in experience preparation (i.e., model fwd) and model update (i.e., actor/critic fwd/bwd), 
 users may need to tune the ``*micro_batch_size_per_gpu`` for different computation.

-In veRL, the core principle for setting batch sizes is:
+In verl, the core principle for setting batch sizes is:

 - **Algorithmic metrics** (train batch size, PPO mini-batch size) are *global* (from a single-controller perspective), 
  normalized in each worker. See the `normalization code <https://github.com/volcengine/verl/blob/main/verl/workers/fsdp_workers.py#L120-L122>`_.

--- a/docs/start/install.rst
+++ b/docs/start/install.rst
@@ -7,7 +7,7 @@ Requirements
 - **Python**: Version >= 3.9
 - **CUDA**: Version >= 12.1

-veRL supports various backends. Currently, the following configurations are available:
+verl supports various backends. Currently, the following configurations are available:

 - **FSDP** and **Megatron-LM** (optional) for training.
 - **vLLM** adn **TGI** for rollout generation, **SGLang** support coming soon.
@@ -34,7 +34,7 @@ Image and tag: ``verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3`
    docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v <image:tag>


-2.	Inside the container, install veRL:
+2.	Inside the container, install verl:

 .. code:: bash

@@ -74,7 +74,7 @@ To manage environment, we recommend using conda:
   conda create -n verl python==3.9
   conda activate verl

-For installing the latest version of veRL, the best way is to clone and
+For installing the latest version of verl, the best way is to clone and
 install it from source. Then you can modify our code to customize your
 own post-training jobs.

@@ -85,7 +85,7 @@ own post-training jobs.
   cd verl
   pip3 install -e .

-You can also install veRL using ``pip3 install``
+You can also install verl using ``pip3 install``

 .. code:: bash

@@ -95,9 +95,9 @@ You can also install veRL using ``pip3 install``
 Dependencies
 ------------

-veRL requires Python >= 3.9 and CUDA >= 12.1.
+verl requires Python >= 3.9 and CUDA >= 12.1.

-veRL support various backend, we currently release FSDP and Megatron-LM
+verl support various backend, we currently release FSDP and Megatron-LM
 for actor training and vLLM for rollout generation.

 The following dependencies are required for all backends, PyTorch FSDP and Megatron-LM.

--- a/examples/ppo_trainer/run_deepseek_megatron.sh
+++ b/examples/ppo_trainer/run_deepseek_megatron.sh
 set -x

+# prepare pre-trained model ckpt
+huggingface-cli download deepseek-ai/deepseek-llm-7b-chat --local-dir $HOME/models/deepseek-llm-7b-chat
+
+# ``actor_rollout_ref.rollout.tensor_model_parallel_size`` in theory could be different from
+# ``**.megatron.tensor_model_parallel_size``
+
 # the config file used: verl/trainer/main_ppo/config/ppo_megatron_trainer.yaml

 python3 -m verl.trainer.main_ppo --config-path=config \
@@ -10,19 +16,22 @@ python3 -m verl.trainer.main_ppo --config-path=config \
    data.val_batch_size=1312 \
    data.max_prompt_length=512 \
    data.max_response_length=512 \
-    actor_rollout_ref.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
+    actor_rollout_ref.model.path=$HOME/models/deepseek-llm-7b-chat \
    actor_rollout_ref.actor.optim.lr=2e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
-    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
+    actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
    critic.optim.lr=2e-5 \
-    critic.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
+    critic.model.path=$HOME/models/deepseek-llm-7b-chat \
    critic.model.enable_gradient_checkpointing=False \
-    critic.ppo_micro_batch_size_per_gpu=8 \
+    critic.ppo_micro_batch_size_per_gpu=4 \
+    critic.megatron.tensor_model_parallel_size=4 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \

--- a/examples/ppo_trainer/verl_getting_started.ipynb
+++ b/examples/ppo_trainer/verl_getting_started.ipynb
@@ -8,13 +8,13 @@
   "source": [
    "# Run Qwen PPO with [verl](https://github.com/volcengine/verl)\n",
    "\n",
-    "This tutorial provides a step-by-step guide to using veRL for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.\n",
+    "This tutorial provides a step-by-step guide to using verl for executing your RLHF pipeline. You can find our [github repo](https://github.com/volcengine/verl/) and [documentation](https://verl.readthedocs.io/en/latest/index.html) for mode details.\n",
    "\n",
    "This notebook is also published on the [Lightning Studio](https://lightning.ai/hlin-verl/studios/verl-getting-started) platform, which provides free GPU quota every month. Checkout the published notebook with pre-installed dependencies using a free L4 GPU [here](https://lightning.ai/hlin-verl/studios/verl-getting-started) (no credit card required).\n",
    "\n",
    "### You will learn:\n",
    "\n",
-    "- How to install veRL from scratch.\n",
+    "- How to install verl from scratch.\n",
    "- How to use existing scripts to run an RLHF pipeline with your own models and data."
   ]
  },

--- a/pyproject.toml
+++ b/pyproject.toml
@@ -18,7 +18,7 @@ name = "verl"
 # The actual version is specified in the [tool.setuptools.dynamic] section below.
 dynamic = ["version"]

-description = "veRL: Volcano Engine Reinforcement Learning for LLM"
+description = "verl: Volcano Engine Reinforcement Learning for LLM"
 license = {file = "LICENSE"}  # or "Apache-2.0", if you prefer an SPDX identifier
 readme = {file = "README.md", content-type = "text/markdown"}
 requires-python = ">=3.8"

--- a/setup.py
+++ b/setup.py
@@ -43,7 +43,7 @@ setup(
    license='Apache 2.0',
    author='Bytedance - Seed - MLSys',
    author_email='zhangchi.usc1992@bytedance.com, gmsheng@connect.hku.hk',
-    description='veRL: Volcano Engine Reinforcement Learning for LLM',
+    description='verl: Volcano Engine Reinforcement Learning for LLM',
    install_requires=install_requires,
    extras_require=extras_require,
    package_data={'': ['version/*'],

--- a/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py
@@ -206,7 +206,7 @@ def initialize_model_parallel(
    backend = backend or torch.distributed.get_backend()

    # NOTE(sgm) we don't assert world_size == tp * pp
-    # DP is not managed by vllm but by the veRL WorkerGroup
+    # DP is not managed by vllm but by the verl WorkerGroup

    num_tensor_model_parallel_groups: int = (world_size // tensor_model_parallel_size)
    num_pipeline_model_parallel_groups: int = (world_size // pipeline_model_parallel_size)

--- a/verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py
@@ -224,7 +224,7 @@ def initialize_model_parallel(
    backend = backend or torch.distributed.get_backend(ps.get_world_group().device_group)

    # NOTE(sgm) we don't assert world_size == tp * pp
-    # DP is not managed by vllm but by the veRL WorkerGroup
+    # DP is not managed by vllm but by the verl WorkerGroup
    # if (world_size !=
    #         tensor_model_parallel_size * pipeline_model_parallel_size):
    #     raise RuntimeError(