[init] feat: upload first open source version of verl

30911f13 · shengguangming · 30911f13 · 30911f13 · 30911f13 · 30911f13
Commit 30911f13 authored Oct 31, 2024 by shengguangming
201 changed files
--- a/.gitignore
+++ b/.gitignore
+**/*.pt
+**/checkpoints
+**/wget-log
+**/_build/
+**/*.ckpt
+**/outputs
+**/*.tar.gz
+**/playground
+**/wandb
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+dataset/*
+tensorflow/my_graph/*
+.idea/
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# IPython Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# dotenv
+.env
+
+# virtualenv
+venv/
+ENV/
+
+# Spyder project settings
+.spyderproject
+
+# Rope project settings
+.ropeproject
+
+# vscode
+.vscode
+
+# Mac
+.DS_Store
+
+# output logs
+tests/e2e/toy_examples/deepspeed/synchronous/output.txt
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+version: 2
+
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.8"
+
+sphinx:
+  configuration: docs/conf.py
+
+python:
+  install:
+    - requirements: docs/requirements-docs.txt
\ No newline at end of file
--- a/.style.yapf
+++ b/.style.yapf
+[style]
+based_on_style = google
+column_limit = 120
+indent_width = 4
+split_arguments_when_comma_terminated: true
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
--- a/Notice.txt
+++ b/Notice.txt
+Copyright 2023-2024 Bytedance Ltd. and/or its affiliates 
\ No newline at end of file
--- a/README.md
+++ b/README.md
+<div align=center>
+  <img src="docs/_static/logo.png" width = "20%" height = "20%" />
+</div>
+
+<h1 style="text-align: center;">veRL: Volcano Engine Reinforcement Learning for LLM</h1>
+
+veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.
+
+veRL is flexible and easy to use with:
+
+- **Easy to support diverse RL(HF) algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
+
+- **Seamless integration of existing LLM infra with modular API design**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
+
+- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
+
+- Readily integration with popular Hugging Face models
+
+
+veRL is fast with:
+
+- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, veRL achieves high generation and training throughput.
+
+- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
+
+
+<p align="center">
+| <a href="https://verl-doc.readthedocs.io/en/latest/index.html"><b>Documentation</b></a> | <a href="https://arxiv.org/abs/2409.19256v2"><b>Paper</b></a> | 
+<!-- <a href=""><b>Slides</b></a> | -->
+</p>
+
+
+
+## Installation
+
+For installing the latest version of veRL, the best way is to clone and install it from source. Then you can modify our code to customize your own post-training jobs.
+
+```bash
+# install verl together with some lightweight dependencies in setup.py
+git clone https://github.com/volcengine/verl.git
+cd verl
+pip3 install -e .
+```
+
+You can also install veRL using `pip3 install`
+
+```bash
+# directly install from pypi
+pip3 install verl
+```
+
+### Dependencies
+
+veRL requires Python >= 3.9 and CUDA >= 12.1.
+
+veRL support various backend, we currently release FSDP and Megatron-LM for actor training and vLLM for rollout generation.
+
+To install the dependencies, we recommend using conda:
+
+```bash
+conda create -n verl python==3.9
+conda activate verl
+```
+
+The following dependencies are required for all backends.
+
+```bash
+# install torch [or you can skip this step and let vllm to install the correct version for you]
+pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
+
+# install vllm
+pip3 install vllm==0.5.4
+pip3 install ray==2.10 # other version may have bug
+
+# flash attention 2
+pip3 install flash-attn --no-build-isolation
+```
+
+**FSDP**
+
+We recommend using FSDP backend to investigate, research and prototype different models, datasets and RL algorithms.
+
+The pros, cons and extension guide for using FSDP backend can be found in fsdp.md
+
+**Megatron-LM**
+
+For users who pursue better scalability, we recommend using Megatron-LM backend. Please install the above dependencies first.
+
+Currently, we support Megatron-LM@core_v0.4.0 and we fix some internal issues of Megatron-LM. Here's the additional installation guide.
+
+The pros, cons and extension guide for using Megatron-LM backend can be found in megatron.md.
+
+```bash
+# FOR Megatron-LM Backend
+# apex
+pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
+         --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" \
+         git+https://github.com/NVIDIA/apex
+
+# transformer engine
+pip3 install git+https://github.com/NVIDIA/TransformerEngine.git@v1.7
+
+# megatron core v0.4.0
+cd ..
+git clone -b core_v0.4.0 https://github.com/NVIDIA/Megatron-LM.git
+cd Megatron-LM
+cp ../verl/patches/megatron_v4.patch .
+git apply megatron_v4.patch
+pip3 install -e .
+export PYTHONPATH=$PYTHONPATH:$(pwd)
+```
+
+## Getting Started
+Visit our [documentation](https://verl-doc.readthedocs.io/en/latest/index.html) to learn more.
+
+**Running an PPO example should follow:**
+- Preparation
+  - [Installation](https://verl-doc.readthedocs.io/en/latest/preparation/install.html)
+  - [Prepare Data (Parquet) for Post-Training](https://verl-doc.readthedocs.io/en/latest/preparation/prepare_data.html)
+  - [Implement Reward Function for Dataset](https://verl-doc.readthedocs.io/en/latest/preparation/reward_function.html)
+- PPO Example (Run an example)
+  - [PPO Example Architecture](https://verl-doc.readthedocs.io/en/latest/examples/ppo_code_architecture.html)
+  - [Config Explanation](https://verl-doc.readthedocs.io/en/latest/examples/config.html)
+  - [Run GSM8K Example](https://verl-doc.readthedocs.io/en/latest/examples/gsm8k_example.html)
+
+**For code explanation and advance usage (extension):**
+- PPO Trainer and Workers
+  - [PPO Ray Trainer](https://verl-doc.readthedocs.io/en/latest/workers/ray_trainer.html)
+  - [PyTorch FSDP Backend](https://verl-doc.readthedocs.io/en/latest/workers/fsdp_workers.html)
+  - [Megatron-LM Backend](https://verl-doc.readthedocs.io/en/latest/index.html)
+- Advance Usage and Extension
+  - [Ray API Design Tutorial](https://verl-doc.readthedocs.io/en/latest/advance/placement.html)
+  - [Extend to other RL(HF) algorithms](https://verl-doc.readthedocs.io/en/latest/advance/dpo_extension.html)
+  - [Add models to FSDP backend](https://verl-doc.readthedocs.io/en/latest/advance/fsdp_extension.html)
+  - [Add models to Megatron-LM backend](https://verl-doc.readthedocs.io/en/latest/advance/megatron_extension.html)
+
+
+## Contribution
+### Code formatting
+We use yapf (Google style) to enforce strict code formatting when reviewing MRs. To reformat you code locally, make sure you installed `yapf`
+```bash
+pip3 install yapf
+```
+Then, make sure you are at top level of verl repo and run
+```bash
+yapf -ir -vv --style ./.style.yapf verl single_controller examples
+```
+
+
+
+## Citation
+
+```tex
+@article{sheng2024hybridflow,
+  title   = {HybridFlow: A Flexible and Efficient RLHF Framework},
+  author  = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
+  year    = {2024},
+  journal = {arXiv preprint arXiv: 2409.19256}
+}
+
+@inproceedings{zhang2024framework,
+  title={A Framework for Training Large Language Models for Code Generation via Proximal Policy Optimization},
+  author={Zhang, Chi and Sheng, Guangming and Liu, Siyao and Li, Jiahao and Feng, Ziyuan and Liu, Zherui and Liu, Xin and Jia, Xiaoying and Peng, Yanghua and Lin, Haibin and Wu, Chuan},
+  booktitle={In NL2Code Workshop of ACM KDD},
+  year={2024}
+}
+```
+
--- a/docs/Makefile
+++ b/docs/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+SPHINXPROJ    = verl
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/README.md
+++ b/docs/README.md
+# veRL documents
+
+## Build the docs
+
+```bash
+# Install dependencies.
+pip install -r requirements-docs.txt
+
+# Build the docs.
+make clean
+make html
+```
+
+## Open the docs with your browser
+
+```bash
+python -m http.server -d _build/html/
+```
+Launch your browser and open localhost:8000.
\ No newline at end of file
--- a/docs/_static/logo.png
+++ b/docs/_static/logo.png
--- a/docs/advance/dpo_extension.rst
+++ b/docs/advance/dpo_extension.rst
+Extend to other RL(HF) algorithms
+=================================
+
+We already implemented the complete training pipeline of the PPO
+algorithms. To extend to other algorithms, we analyze the high-level
+principle to use veRL and provide a tutorial to implement the DPO
+algorithm. Users can follow the similar paradigm to extend to other RL algorithms.
+
+.. note:: **Key ideas**: Single process drives multi-process computation and data communication.
+
+Overall Approach
+----------------
+
+Step 1: Consider what multi-machine multi-GPU computations are needed
+for each model, such as ``generate_sequence`` , ``compute_log_prob`` and
+``update_policy`` in the actor_rollout model. Implement distributed
+single-process-multiple-data (SPMD) computation and encapsulate them
+into APIs
+
+Step 2: Based on different distributed scenarios, including FSDP and 3D
+parallelism in Megatron-LM, implement single-process control of data
+interaction among multi-process computations.
+
+Step 3: Utilize the encapsulated APIs to implement the control flow
+
+Example: Online DPO
+-------------------
+
+We use veRL to implement a simple online DPO algorithm. The algorithm
+flow of Online DPO is as follows:
+
+1. There is a prompt (rollout) generator which has the same weight as
+   the actor model. After a batch of prompts are fed into the generator,
+   it generates N responses for each prompt.
+2. Send all the prompts + responses to a verifier for scoring, which can
+   be reward model or a rule-based function. Then sort them in pairs to
+   form a training batch.
+3. Use this training batch to train the actor model using DPO. During
+   the process, a reference policy is needed.
+
+Step 1: What are the multi-machine multi-GPU computations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Sample Generator**
+
+Implementation details:
+
+.. code:: python
+
+   from single_controller.base import Worker
+   from single_controller.ray import RayWorkerGroup, RayClassWithInitArgs, RayResourcePool
+   import ray
+
+   @ray.remote
+   class SampleGenerator(Worker):
+       def __init__(self, config):
+           super().__init__()
+           self.config = config
+           
+       def generate_sequences(self, data):
+           pass
+
+Here, ``SampleGenerator`` can be viewed as a multi-process pulled up by
+``torchrun``, with each process running the same code (SPMD).
+``SampleGenerator`` needs to implement a ``generate_sequences`` API for
+the control flow to call. The implementation details inside can use any
+inference engine including vllm, sglang and huggingface. Users can
+largely reuse the code in
+verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we won’t
+go into details here.
+
+**ReferencePolicy inference**
+
+API: compute reference log probability
+
+.. code:: python
+
+   from single_controller.base import Worker
+   import ray
+
+   @ray.remote
+   class ReferencePolicy(Worker):
+       def __init__(self):
+           super().__init__()
+           self.model = Model()
+           
+       def infer(self, data):
+           return self.model(data)
+
+**Actor update**
+
+API: Update actor model parameters
+
+.. code:: python
+
+   from single_controller.base import Worker
+   import ray
+
+   @ray.remote
+   class DPOActor(Worker):
+       def __init__(self):
+           super().__init__()
+           self.model = Model()
+           self.model = FSDP(self.model)  # or other distributed strategy
+           self.optimizer = optim.Adam(self.model.parameters(), lr=1e-3)
+           self.loss_fn = xxx
+           
+       def update(self, data):
+           self.optimizer.zero_grad()
+           logits = self.model(data)
+           loss = self.loss_fn(logits)
+           loss.backward()
+           self.optimizer.step()
+
+**Notes: How to distinguish between control processes and distributed computation processes**
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- Control processes are generally functions directly decorated with
+  ``@ray.remote``
+- Computation processes are all wrapped into a ``RayWorkerGroup``.
+
+Users can reuse most of the distribtued computation logics implemented
+in PPO algorithm, including FSDP and Megatron-LM backend in
+verl/verl/trainer/ppo.
+
+Step 2: Based on different distributed scenarios, implement single-process control of multi-process data interaction
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**The core problem to solve here is how a single process sends data to
+multiple processes, drives multi-process computation, and how the
+control process obtains the results of multi-process computation.**
+First, we initialize the multi-process ``WorkerGroup`` in the control
+process.
+
+.. code:: python
+
+   @ray.remote(num_cpus=1)
+   def main_task(config):
+       # construct SampleGenerator
+       resource_pool = RayResourcePool(process_on_nodes=[8] * 2)  # 16 GPUs
+       ray_cls = RayClassWithInitArgs(SampleGenerator, config=config)
+       # put SampleGenerator onto resource pool
+       worker_group = RayWorkerGroup(resource_pool, ray_cls)
+       
+       # construct reference policy
+
+As we can see, in the control process, multiple processes are wrapped
+into a ``RayWorkerGroup``. Inside this ``WorkerGroup``, there is a
+``self._workers`` member, where each worker is a RayActor
+(https://docs.ray.io/en/latest/ray-core/actors.html) of SampleGenerator.
+ray_trainer.md also provide an implementation of
+``MegatronRayWorkerGroup``.
+
+Assuming the model is distributed using FSDP, and there is a batch of
+data on the control process, for data parallelism, the underlying
+calling process is:
+
+.. code:: python
+
+   data = xxx
+   data_list = data.chunk(dp_size)
+
+   output = []
+   for d in data_list:
+       # worker_group._workers[i] is a SampleGenerator
+       output.append(worker_group._workers[i].generate_sequences.remote(d))
+
+   output = ray.get(output)
+   output = torch.cat(output)
+
+Single process calling multiple processes involves the following 3
+steps:
+
+1. Split the data into DP parts on the control process.
+2. Send the data to remote, call the remote computation through RPC, and
+   utilize multi-process computation.
+3. Obtain the computation results of each worker on the control process
+   and merge them.
+
+Frequently calling these 3 steps on the controller process greatly hurts
+code readability. **In veRL, we have abstracted and encapsulated these 3
+steps, so that the worker’s method + dispatch + collect can be
+registered into the worker_group**
+
+.. code:: python
+
+   from single_controller.base.decorator import register
+
+   def dispatch_data(worker_group, data):
+       return data.chunk(worker_group.world_size)
+       
+   def collect_data(worker_group, data):
+       return torch.cat(data)
+
+   dispatch_mode = {
+       'dispatch_fn': dispatch_data,
+       'collect_fn': collect_data
+   }
+
+   @register(dispatch_mode=dispatch_mode)
+   def generate_sequences(self, data):
+       pass
+
+In this way, we can directly call the method inside the worker through
+the ``worker_group`` on the control (driver) process (which is a single
+process):
+
+.. code:: python
+
+   output = worker_group.generate_sequences(data)
+
+This single line includes data splitting, data distribution and
+computation, and data collection.
+
+Furthermore, the model parallelism size of each model is usually fixed,
+including dp, tp, pp. So for these common distributed scenarios, we have
+pre-implemented specific dispatch and collect methods,in `decorator.py <https://github.com/volcengine/verl/blob/main/single_controller/base/decorator.py>`_, which can be directly used to wrap the computations.
+
+.. code:: python
+
+   from single_controller.base.decorator import register, Dispatch
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def generate_sequences(self, data: DataProto) -> DataProto:
+       pass
+
+Here it requires the data interface to be ``DataProto``. Definition of
+``DataProto`` is in `protocol.py <https://github.com/volcengine/verl/blob/main/verl/protocol.py>`_.
+
+Step 3: Main training loop
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+With the above training flows, we can implement the algorithm’s control
+flow. It is recommended that ``main_task`` is also a ray remote process.
+
+.. code:: python
+
+   @ray.remote(num_cpus=1)
+   def main_task(config):
+       # construct SampleGenerator
+       resource_pool = RayResourcePool(process_on_nodes=[8] * 2)  # 16 GPUs
+       ray_cls = RayClassWithInitArgs(SampleGenerator, config=config) 
+       # put SampleGenerator onto resource pool
+       sample_gen = RayWorkerGroup(resource_pool, ray_cls)
+       
+       # construct reference policy
+       ray_cls = RayClassWithInitArgs(ReferencePolicy)
+       ref_policy = RayWorkerGroup(resource_pool, ray_cls)
+       
+       # construct actor
+       ray_cls = RayClassWithInitArgs(DPOActor)  
+       dpo_policy = RayWorkerGroup(resource_pool, ray_cls)
+       
+       dataloader = DataLoader()
+       
+       for data in dataloader:
+           # generate data
+           data = sample_gen.generate_sequences(data)
+           # generate scores for each data 
+           data = generate_scores(data)
+           # generate pairwise data using scores
+           data = generate_pairwise_data(data)
+           # generate ref_log_prob
+           data.batch['ref_log_prob'] = ref_policy.infer(data)
+           # update using dpo
+           dpo_policy.update(data)
+           # logging
+
+Here, different ``WorkerGroups`` can be placed in the same resource pool or
+in different resource pools using ``create_colocated_worker_cls``
+similar as in `ray_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py>`_.
--- a/docs/advance/fsdp_extension.rst
+++ b/docs/advance/fsdp_extension.rst
+
+Add models to FSDP backend
+===========================
+
+Model
+--------------------------
+
+In principle, our FSDP backend can support any HF model and we can
+sychronoize the actor model weight with vLLM using `hf_weight_loader.py <https://github.com/volcengine/verl/blob/main/verl/third_party/vllm/vllm_v_0_5_4/hf_weight_loader.py>`_.
+However, ``hf_weight_loader`` is will gather the full state_dict of a
+model during synchronization, which may cause OOM. We suggest using
+``dtensor_weight_loader`` which gather the full model parameter layer by
+layer to reduce the peak memory usage. We already support dtensor weight
+loader for the models below in `dtensor_weight_loader.py <https://github.com/volcengine/verl/blob/main/verl/third_party/vllm/vllm_v_0_5_4/dtensor_weight_loader.py>`_.:
+
+- ``GPT2LMHeadModel``
+- ``LlamaForCausalLM``
+- ``LLaMAForCausalLM``
+- ``MistralForCausalLM``
+- ``InternLMForCausalLM``
+- ``AquilaModel``
+- ``AquilaForCausalLM``
+- ``Phi3ForCausalLM``
+- ``GemmaForCausalLM``
+- ``Gemma2ForCausalLM``
+- ``GPTBigCodeForCausalLM``
+- ``Starcoder2ForCausalLM``
+- ``Qwen2ForCausalLM``
+- ``DeepseekV2ForCausalLM``
+
+To implement ``dtensor_weight_loader`` of a model that’s supported in
+vLLM, follow the guide of gemma model below:
+
+1. Copy the
+   ``load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]])`` from the vllm model class
+   to ``dtensor_weight_loaders.py``
+2. Modify the arguments to
+   ``(actor_weights: Dict, vllm_model: nn.Module)``
+3. Replace the ``self`` to ``vllm_model``
+4. Add the
+   ``local_loaded_weight = redistribute_dtensor(param_name=name, loaded_weights=loaded_weight)``
+   before each ``param = params_dict[name]`` and modify the following
+   weight loading using ``local_loaded_weight``.
+5. Register the implemented dtensor weight loader to ``__MODEL_DTENSOR_WEIGHT_LOADER_REGISTRY__``.
+
+.. code-block:: diff
+
+    - def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+    + def gemma_dtensor_weight_loader(actor_weights: Dict, vllm_model: nn.Module) -> nn.Module:
+        stacked_params_mapping = [
+            # (param_name, shard_name, shard_id)
+            ("qkv_proj", "q_proj", "q"),
+            ("qkv_proj", "k_proj", "k"),
+            ("qkv_proj", "v_proj", "v"),
+            ("gate_up_proj", "gate_proj", 0),
+            ("gate_up_proj", "up_proj", 1),
+        ]
+    -   params_dict = dict(self.named_parameters())
+    +   params_dict = dict(vllm_model.named_parameters())
+        loaded_params = set()
+    -   for name, loaded_weight in weights:
+    +   for name, loaded_weight in actor_weights.items():
+            for (param_name, shard_name, shard_id) in stacked_params_mapping:
+                if shard_name not in name:
+                    continue
+                name = name.replace(shard_name, param_name)
+                # Skip loading extra bias for GPTQ models.
+                if name.endswith(".bias") and name not in params_dict:
+                    continue
+    +           local_loaded_weight = redistribute_dtensor(param_name=name, loaded_weights=loaded_weight)
+                param = params_dict[name]
+                weight_loader = param.weight_loader
+    -           weight_loader(param, loaded_weight, shard_id)
+    +           weight_loader(param, local_loaded_weight.to(dtype=param.dtype), shard_id)
+                break
+            else:
+                # lm_head is not used in vllm as it is tied with embed_token.
+                # To prevent errors, skip loading lm_head.weight.
+                if "lm_head.weight" in name:
+                    continue
+                # Skip loading extra bias for GPTQ models.
+                if name.endswith(".bias") and name not in params_dict:
+                    continue
+    +           local_loaded_weight = redistribute_dtensor(param_name=name, loaded_weights=loaded_weight)
+                param = params_dict[name]
+                weight_loader = getattr(param, "weight_loader",
+                                        default_weight_loader)
+                weight_loader(param, loaded_weight)
+            loaded_params.add(name)
+        unloaded_params = params_dict.keys() - loaded_params
+        if unloaded_params:
+            raise RuntimeError(
+                "Some weights are not initialized from checkpoints: "
+                f"{unloaded_params}")
\ No newline at end of file
--- a/docs/advance/megatron_extension.rst
+++ b/docs/advance/megatron_extension.rst
+Add models to Megatron-LM backend
+===========
+
+Model
+-----------
+
+The most challenging aspect to use Megatron-LM backend is implementing
+the models for training. Currently, we implement Llama model that
+support data parallelism, tensor parallelism, pipeline parallelism (also
+vPP) and sequence parallelism. We also implement remove padding on Llama
+model, which can be found in `modeling_llama_megatron.py <https://github.com/volcengine/verl/blob/main/verl/models/llama/megatron/modeling_llama_megatron.py>`_.
+
+To support other model, users are required to implement:
+
+1. Implemnt a model similar to ``modeling_llama_megatron.py`` that satisfy the
+   parallelism requirements of Megatron-LM. Then register your model in
+   the `registry.py <https://github.com/volcengine/verl/blob/main/verl/models/registry.py>`_.
+2. Checkpoint utils that can load full checkpoint (e.g. huggingface
+   checkpoint) to partitioned models during the runtime. Then register
+   your loader to ``weight_loader_registry`` in `weight_loader_registry.py <https://github.com/volcengine/verl/blob/main/verl/models/weight_loader_registry.py>`_.
+3. Weight loader that synchronize the weight from Megatron to rollout
+   (vLLM) model. Note that both the actor model and rollout model are
+   partitioned during runtime. So, it’s advisable to map the model name
+   in actor model implementation. Otherwise, you may need an additional
+   name mapping and even weight transformation.
\ No newline at end of file
--- a/docs/advance/placement.rst
+++ b/docs/advance/placement.rst
+Ray API Design Tutorial
+=======================================
+
+We provide a tutorial for our Ray API design, including:
+
+- Ray basic concepts
+- Resource Pool and RayWorkerGroup
+- Data Dispatch, Execution and Collection
+- Initialize the RayWorkerGroup and execute the distributed computation in the given Resource Pool
+
+See details in `tutorial.ipynb <https://github.com/volcengine/verl/blob/main/examples/ray/tutorial.ipynb>`_.
\ No newline at end of file
--- a/docs/conf.py
+++ b/docs/conf.py
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+
+
+# -- Project information -----------------------------------------------------
+
+project = u'veRL'
+# pylint: disable=W0622
+copyright = u'2024 ByteDance Seed Foundation MLSys Team'
+author = u'Guangming Sheng, Chi Zhang, Yanghua Peng, Haibin Lin'
+
+
+# -- General configuration ---------------------------------------------------
+# The master toctree document.
+master_doc = 'index'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = ['recommonmark',
+  'sphinx.ext.autosectionlabel',
+]
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+source_suffix = ['.rst', 'rest', '.md']
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = u'en'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'sphinx_rtd_theme'
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
\ No newline at end of file
--- a/docs/examples/config.rst
+++ b/docs/examples/config.rst
--- a/docs/examples/gsm8k_example.rst
+++ b/docs/examples/gsm8k_example.rst
+GSM8K Example
+=============
+
+Introduction
+------------
+
+In this example, we train an LLM to tackle the GSM8k task.
+
+Paper: https://arxiv.org/pdf/2110.14168
+
+Dataset: https://huggingface.co/datasets/gsm8k
+
+Note that the original paper mainly focuses on training a verifier (a
+reward model) to solve math problems via Best-of-N sampling. In this
+example, we train an RLHF agent using a rule-based reward model.
+
+Dataset Introduction
+--------------------
+
+GSM8k is a math problem dataset. The prompt is an elementary school
+problem. The LLM model is required to answer the math problem.
+
+The training set contains 7473 samples and the test set contains 1319
+samples.
+
+**An example**
+
+Prompt
+
+   Katy makes coffee using teaspoons of sugar and cups of water in the
+   ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups
+   of water, calculate the number of teaspoonfuls of sugar she used.
+
+Solution
+
+   The total ratio representing the ingredients she used to make the
+   coffee is 7+13 = <<7+13=20>>20 Since the fraction representing the
+   number of teaspoons she used is 7/20, she used 7/20\ *120 =
+   <<7/20*\ 120=42>>42 #### 42
+
+Step 1: Prepare dataset
+-----------------------
+
+.. code:: bash
+
+   cd examples/data_preprocess
+   python3 gsm8k.py --local_dir ~/data/gsm8k
+
+Step 2: Download Model
+----------------------
+
+There’re three ways to prepare the model checkpoints for post-training:
+
+- Download the required models from hugging face
+
+.. code:: bash
+
+   huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct --local-dir-use-symlinks False
+
+- Already store your store model in the local directory or HDFS path.
+- Also, you can directly use the model name in huggingface (e.g.,
+  deepseek-ai/deepseek-math-7b-instruct) in
+  ``actor_rollout_ref.model.path`` and ``critic.model.path`` field in
+  the run script.
+
+Noted that users should prepare checkpoints for actor, critic and reward
+model.
+
+[Optional] Step 3: SFT your Model
+---------------------------------
+
+We provide a SFT Trainer using PyTorch FSDP in
+`fsdp_sft_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py>`_. 
+Users can customize their own SFT
+script using our FSDP SFT Trainer.
+
+We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft directory <https://github.com/volcengine/verl/blob/main/examples/gsm8k/sft/>`_.
+
+.. code:: shell
+
+   set -x
+
+   torchrun -m verl.trainer.fsdp_sft_trainer \
+       data.train_files=$HOME/data/gsm8k/train.parquet \
+       data.val_files=$HOME/data/gsm8k/test.parquet \
+       data.prompt_key=question \
+       data.response_key=answer \
+       data.micro_batch_size=8 \
+       model.partial_pretrain=deepseek-ai/deepseek-coder-6.7b-instruct \
+       trainer.default_hdfs_dir=hdfs://user/verl/experiments/gsm8k/deepseek-coder-6.7b-instruct/ \
+       trainer.project_name=gsm8k-sft \
+       trainer.experiment_name=gsm8k-sft-deepseek-coder-6.7b-instruct \
+       trainer.total_epochs=4 \
+       trainer.logger=['console','tracking']
+
+Step 4: Perform PPO training with your model on GSM8K Dataset
+-------------------------------------------------------------
+
+- Prepare your own run.sh script. Here’s an example for GSM8k dataset
+  and deepseek-llm-7b-chat model.
+- Users could replace the ``data.train_files`` ,\ ``data.val_files``,
+  ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
+  their environment.
+- See :doc:`config` for detailed explaination of each config field.
+
+**Reward Model/Function**
+
+We use a rule-based reward model. We force the model to produce a final
+answer following 4 “#” as shown in the solution. We extract the final
+answer from both the solution and model’s output using regular
+expression matching. We compare them and assign a reward of 1 to correct
+answer, 0.1 to incorrect answer and 0 to no answer.
+
+**Training Script**
+
+The training script example for FSDP and Megatron-LM backend are stored in examples/ppo_trainer directory.
+
+.. code:: bash
+
+   cd ../ppo_trainer
+   bash run_deepseek7b_llm.sh
+
+The script of run_deepseek7b_llm.sh
+
+.. code:: bash
+
+   set -x
+
+   python3 -m verl.trainer.main_ppo \
+       data.train_files=~/data/rlhf/gsm8k/train.parquet \
+       data.val_files=~/data/rlhf/gsm8k/test.parquet \
+       data.train_batch_size=1024 \
+       data.val_batch_size=1312 \
+       data.max_prompt_length=512 \
+       data.max_response_length=512 \
+       actor_rollout_ref.model.path=~/models/deepseek-llm-7b-chat \
+       actor_rollout_ref.actor.optim.lr=1e-6 \
+       actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+       actor_rollout_ref.actor.ppo_micro_batch_size=64 \
+       actor_rollout_ref.actor.fsdp_config.param_offload=False \
+       actor_rollout_ref.actor.fsdp_config.grad_offload=False \
+       actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+       actor_rollout_ref.rollout.micro_batch_size=256 \
+       actor_rollout_ref.rollout.log_prob_micro_batch_size=128 \
+       actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+       actor_rollout_ref.rollout.name=vllm \
+       actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+       actor_rollout_ref.ref.log_prob_micro_batch_size=128 \
+       actor_rollout_ref.ref.fsdp_config.param_offload=True \
+       critic.optim.lr=1e-5 \
+       critic.model.path=~/models/deepseek-llm-7b-chat \
+       critic.model.enable_gradient_checkpointing=False \
+       critic.ppo_micro_batch_size=64 \
+       critic.model.fsdp_config.param_offload=False \
+       critic.model.fsdp_config.grad_offload=False \
+       critic.model.fsdp_config.optimizer_offload=False \
+       algorithm.kl_ctrl.kl_coef=0.001 \
+       trainer.critic_warmup=0 \
+       trainer.logger=['console','tracking'] \
+       trainer.project_name='verl_example_gsm8k' \
+       trainer.experiment_name='deepseek_llm_7b_function_rm' \
+       trainer.n_gpus_per_node=8 \
+       trainer.nnodes=1 \
+       trainer.save_freq=-1 \
+       trainer.total_epochs=15
--- a/docs/examples/ppo_code_architecture.rst
+++ b/docs/examples/ppo_code_architecture.rst
+PPO Example Architecture
+========================
+
+Let’s start with the Proximal Policy Optimization algorithm, which is
+most widely used algorithm in LLM post-training.
+
+The main entry point of the PPO algorithm example is:
+`main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py>`_.
+In this tutorial, we will go through the code architecture in `main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py>`_.
+
+Define the data
+---------------
+
+Users need to preprocess and store the dataset in parquet files.
+And we implement `RLHFDataset` to load and tokenize the parquet files.
+
+For ``RLHFDataset`` (Default), at least 1 fields are required:
+
+- ``prompt``: Contains the string prompt
+
+We already provide some examples of processing the datasets to parquet
+files in `data_preprocess directory <https://github.com/volcengine/verl/blob/main/examples/data_preprocess>`_. Currently, we support
+preprocess of GSM8k, MATH, Hellasage, Full_hh_rlhf datasets. See :doc:`../preparation/prepare_data` for
+more information.
+
+Define the reward functions for different datasets
+--------------------------------------------------
+
+In this main entry point, the users only need to define their own reward
+function based on the datasets (or applications) utilized in PPO
+training.
+
+For example, we already provide reward functions for `GSM8k <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py>`_ 
+and `MATH <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/math.py>`_
+datasets in the ``_select_rm_score_fn``. In the ``RewardManager``, we
+will compute the reward score based on the data_source to select
+corresponding reward functions. For some RLHF datasets (e.g.,
+full_hh_rlhf), the reward model is utilized to assess the responses
+without any reward functions. In this case, the ``RewardManager`` will
+return the ``rm_score`` computed by the reward model directly.
+
+See `reward functions <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_ for detailed implementation.
+
+Define worker classes
+---------------------
+
+.. code:: python
+
+   if config.actor_rollout_ref.actor.strategy == 'fsdp': # for FSDP backend
+       assert config.actor_rollout_ref.actor.strategy == config.critic.strategy
+       from verl.trainer.ppo.workers.fsdp_workers import ActorRolloutRefWorker, CriticWorker
+       from single_controller.ray import RayWorkerGroup
+       ray_worker_group_cls = RayWorkerGroup
+
+   elif config.actor_rollout_ref.actor.strategy == 'megatron': # for Megatron backend
+       assert config.actor_rollout_ref.actor.strategy == config.critic.strategy
+       from verl.trainer.ppo.workers.megatron_workers import ActorRolloutRefWorker, CriticWorker
+       from single_controller.ray.megatron import NVMegatronRayWorkerGroup
+       ray_worker_group_cls = NVMegatronRayWorkerGroup # Ray worker class for Megatron-LM
+
+   else:
+       raise NotImplementedError
+
+   from verl.trainer.ppo.ray_trainer import ResourcePoolManager, Role
+
+   role_worker_mapping = {
+       Role.ActorRollout: ActorRolloutRefWorker,
+       Role.Critic: CriticWorker,
+       Role.RefPolicy: ActorRolloutRefWorker
+   }
+
+   global_pool_id = 'global_pool'
+   resource_pool_spec = {
+       global_pool_id: [config.trainer.n_gpus_per_node] * config.trainer.nnodes,
+   }
+   mapping = {
+       Role.ActorRollout: global_pool_id,
+       Role.Critic: global_pool_id,
+       Role.RefPolicy: global_pool_id,
+   }
+
+Step 1: Construct the mapping between roles and workers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A role represents a group of workers in the same process. We have
+pre-defined several roles in `ray_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py#L38>`_.
+
+.. code:: python
+
+   class Role(Enum):
+       """
+       To create more roles dynamically, you can subclass Role and add new members
+       """
+       Actor = 0  # This worker only has Actor
+       Rollout = 1 # This worker only has Rollout
+       ActorRollout = 2 # This worker has both actor and rollout, it's a HybridEngine
+       Critic = 3 # This worker only has critic
+       RefPolicy = 4 # This worker only has reference policy
+       RewardModel = 5 # This worker only has reward model
+       ActorRolloutRef = 6 # This worker contains actor, rollout and reference policy simultaneously 
+
+Step 2: Define the worker class corresponding to this role
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- We have pre-implemented the ``ActorRolloutRefWorker``. Through
+  different configs, it can be a standalone actor, a standalone rollout,
+  an ActorRollout HybridEngine, or an ActorRolloutRef HybridEngine
+- We also pre-implemented workers for ``Actor``, ``Rollout``,
+  ``Critic``, ``Reward Model`` and ``Reference model`` on two different
+  backend: PyTorch FSDP
+  and Megatron-LM.
+  See `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/fsdp_workers.py>`_ 
+  and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/megatron_workers.py>`_
+  for more information.
+
+Step 3: Define resource pool id and resource pool spec
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Resource pool is a division of global GPU resources,
+  ``resource_pool_spec`` is a dict, mapping from id to # of GPUs
+
+  - In the above example, we defined a global resource pool:
+    global_pool_id, and then put all roles on this one resource pool
+    with all the GPUs in this post-training task. This refers to
+    *co-locate* placement where all the models share the same set of
+    GPUs.
+
+- See resource pool and placement for advance usage.
+
+Defining reward model/function
+------------------------------
+
+.. code:: python
+
+   # we should adopt a multi-source reward function here
+   # - for rule-based rm, we directly call a reward score
+   # - for model-based rm, we call a model
+   # - for code related prompt, we send to a sandbox if there are test cases
+   # - finally, we combine all the rewards together
+   # - The reward type depends on the tag of the data
+   if config.reward_model.enable:
+       from verl.trainer.ppo.workers.fsdp_workers import RewardModelWorker
+       role_worker_mapping[Role.RewardModel] = RewardModelWorker
+       mapping[Role.RewardModel] = global_pool_id
+    
+   reward_fn = RewardManager(tokenizer=tokenizer, num_examine=0)
+
+   # Note that we always use function-based RM for validation
+   val_reward_fn = RewardManager(tokenizer=tokenizer, num_examine=1)
+
+   resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping)
+
+Since not all tasks use model-based RM, users need to define here
+whether it’s a model-based RM or a function-based RM
+
+- If it’s a model-based RM, directly add the ``RewardModel`` role in the
+  resource mapping and add it to the resource pool mapping.
+
+  - Note that the pre-defined ``RewardModelWorker`` only supports models
+    with the structure of huggingface
+    ``AutoModelForSequenceClassification``. If it’s not this model, you
+    need to define your own RewardModelWorker in `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/fsdp_workers.py>`_ 
+    and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/megatron_workers.py>`_.
+
+- If it’s a function-based RM, the users are required to classified the
+  reward function for each datasets.
+
+.. code:: python
+
+   def _select_rm_score_fn(data_source):
+       if data_source == 'openai/gsm8k':
+           return gsm8k.compute_score
+       elif data_source == 'lighteval/MATH':
+           return math.compute_score
+       else:
+           raise NotImplementedError
+
+See reward functions implemented in `directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/>`_ 
+for more information.
+
+Define, init and run the PPO Trainer
+------------------------------------
+
+.. code:: python
+
+   trainer = RayPPOTrainer(config=config,
+                           tokenizer=tokenizer,
+                           role_worker_mapping=role_worker_mapping,
+                           resource_pool_manager=resource_pool_manager,
+                           ray_worker_group_cls=ray_worker_group_cls,
+                           reward_fn=reward_fn,
+                           val_reward_fn=val_reward_fn)
+   trainer.init_workers()
+   trainer.fit()
+
+- We first initialize the ``RayPPOTrainer`` with user config, tokenizer
+  and all the above worker mapping, resource pool, worker group and
+  reward functions
+- We first call the ``trainer.init_workers()`` to initialize the models
+  on the allocated GPUs (in the resource pool)
+- The actual PPO training will be executed in ``trainer.fit()``
+
+veRL can be easily extended to other RL algorithms by reusing the Ray
+model workers, resource pool and reward functions. See :doc:`extension<../advance/dpo_extension>` for
+more information.
+
+Details of the ``RayPPOTrainer`` is discussed in :doc:`Ray Trainer<../workers/ray_trainer>`.
--- a/docs/index.rst
+++ b/docs/index.rst
+Welcome to veRL/HybridFlow's documentation!
+================================================
+
+veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs) Post-Training. 
+
+veRL is flexible and easy to use with:
+
+- **Easy to support diverse RL(HF) algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
+
+- **Seamless integration of existing LLM infra with modular API design**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
+
+- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
+
+- Readily integration with popular Hugging Face models
+
+
+veRL is fast with:
+
+- **State-of-the-art throughput**: By seamlessly integrating existing SOTA LLM training and inference frameworks, veRL achieves high generation and training throughput.
+
+- **Efficient actor model resharding with 3D-HybridEngine**: Eliminates memory redundancy and significantly reduces communication overhead during transitions between training and generation phases.
+
+--------------------------------------------
+
+.. _Contents:
+
+.. toctree::
+   :maxdepth: 5
+   :caption: Preparation
+   :titlesonly:
+   :numbered:
+
+   preparation/install
+   preparation/prepare_data
+   preparation/reward_function
+
+.. toctree::
+   :maxdepth: 2
+   :caption: PPO Example
+   :titlesonly:
+   :numbered:
+
+   examples/ppo_code_architecture
+   examples/config
+   examples/gsm8k_example
+
+.. toctree:: 
+   :maxdepth: 1
+   :caption: PPO Trainer and Workers
+
+   workers/ray_trainer
+   workers/fsdp_workers
+   workers/megatron_workers
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Advance Usage and Extension
+
+   advance/placement
+   advance/dpo_extension
+   advance/fsdp_extension
+   advance/megatron_extension
+
+
+Contribution
+-------------
+
+veRL is free software; you can redistribute it and/or modify it under the terms
+of the Apache License 2.0. We welcome contributions.
+Join us on `GitHub <https://github.com/volcengine/verl>`_ .
+
+.. and check out our
+.. :doc:`contribution guidelines <contribute>`.
--- a/docs/preparation/install.rst
+++ b/docs/preparation/install.rst
+Installation
+============
+
+To install the veRL, we recommend using conda:
+
+.. code:: bash
+
+   conda create -n verl python==3.9
+   conda activate verl
+
+For installing the latest version of veRL, the best way is to clone and
+install it from source. Then you can modify our code to customize your
+own post-training jobs.
+
+.. code:: bash
+
+   # install verl together with some lightweight dependencies in setup.py
+   git clone https://github.com/volcengine/verl.git
+   cd verl
+   pip3 install -e .
+
+You can also install veRL using ``pip3 install``
+
+.. code:: bash
+
+   # directly install from pypi
+   pip3 install verl
+
+Dependencies
+------------
+
+veRL requires Python >= 3.9 and CUDA >= 12.1.
+
+veRL support various backend, we currently release FSDP and Megatron-LM
+for actor training and vLLM for rollout generation.
+
+The following dependencies are required for all backends, PyTorch FSDP and Megatron-LM.
+
+The pros, cons and extension guide for using PyTorch FSDP backend can be
+found in :doc:`FSDP Workers<../workers/fsdp_workers>`.
+
+.. code:: bash
+
+   # install torch [or you can skip this step and let vllm to install the correct version for you]
+   pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
+
+   # install vllm
+   pip3 install vllm==0.5.4
+   pip3 install ray==2.10 # other version may have bug
+
+   # flash attention 2
+   pip3 install flash-attn --no-build-isolation
+
+For users who pursue better scalability, we recommend using Megatron-LM
+backend. Please install the above dependencies first.
+
+Currently, we support Megatron-LM\@core_v0.4.0 and we fix some internal
+issues of Megatron-LM. Here’s the additional installation guide.
+
+The pros, cons and extension guide for using Megatron-LM backend can be
+found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
+
+.. code:: bash
+
+   # FOR Megatron-LM Backend
+   # apex
+   pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
+            --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" \
+            git+https://github.com/NVIDIA/apex
+
+   # transformer engine
+   pip3 install git+https://github.com/NVIDIA/TransformerEngine.git@v1.7
+
+   # megatron core v0.4.0
+   cd ..
+   git clone -b core_v0.4.0 https://github.com/NVIDIA/Megatron-LM.git
+   cd Megatron-LM
+   cp ../verl/patches/megatron_v4.patch .
+   git apply megatron_v4.patch
+   pip3 install -e .
+   export PYTHONPATH=$PYTHONPATH:$(pwd)
--- a/docs/preparation/prepare_data.rst
+++ b/docs/preparation/prepare_data.rst
+Prepare Data (Parquet) for Post-Training
+========================================
+
+Before starting the post-training job, we need to prepare the data for
+the policy training. The data should be stored in the parquet format.
+
+We provide several data preprocess scripts for different datasets,
+including GSM8K, MATH, HelloSwag, Full_hh_rlhf. To prepare other datasets, we need
+to follow the following steps: The data preprocess script can be divided
+into two parts:
+
+1. The first part is the common part, which loads the dataset from
+   huggingface’s ``datasets`` package. Then preprocess the datasets with
+   the ``make_map_fn`` and then store in the parquet format.
+
+.. code:: python
+
+   import re
+   import os
+   import datasets
+
+   from verl.utils.hdfs_io import copy, makedirs
+   import argparse
+
+   # To extract the solution for each prompts in the dataset
+   # def extract_solution(solution_str): 
+   # ...
+
+
+   if __name__ == '__main__':
+       parser = argparse.ArgumentParser()
+       parser.add_argument('--local_dir', default='/opt/tiger/gsm8k')
+       parser.add_argument('--hdfs_dir', default='hdfs://haruna/home/byte_data_seed/lf_lq/user/zhangchi.usc1992/data/rlhf')
+
+       args = parser.parse_args()
+
+       num_few_shot = 5
+       data_source = 'openai/gsm8k'
+
+       dataset = datasets.load_dataset(data_source, 'main')
+
+       train_dataset = dataset['train']
+       test_dataset = dataset['test']
+
+           # Construct a `def make_map_fn(split)` for the corresponding datasets.
+       # ...
+           
+       train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+       test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
+
+       local_dir = args.local_dir
+       hdfs_dir = args.hdfs_dir
+
+       train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+       test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+       makedirs(hdfs_dir)
+
+       copy(src=local_dir, dst=hdfs_dir)
+
+2. The users are required to implement the ``make_map_fn()`` function
+   (as well as the ``extract_solution``) on their own to support
+   different datasets or tasks.
+
+We already implemented the data preprocess of GSM8k, MATH, Hellaswag and Full_hh_rlhf
+datasets. And we take the GSM8k dataset as an example:
+
+**GSM8K**
+
+In the ``make_map_fn``, each data field should consist of the following
+5 fields:
+
+1. ``data_source``: The name of the dataset. To index the corresponding
+   reward function in the ``RewardModule``
+2. ``prompt``: This field should be constructed in the format of
+   huggingface chat_template. The tokenizer in ``RLHFDataset`` will
+   apply chat template and tokenize the prompt.
+3. ``ability``: Define the task category.
+4. ``reward_model``: Currently, we only utilize the ``ground_truth``
+   field during evaluation. The ``ground_truth`` is computed by the
+   ``extract_solution`` function. **NOTED** that the implementation of
+   the corresponding reward function should align with this extracted
+   ``ground_truth``.
+5. ``extra_info``: Record some information of the current prompt. Not
+   use for now.
+
+.. code:: python
+
+   def extract_solution(solution_str):
+       solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str) # extract the solution after ####
+       assert solution is not None
+       final_solution = solution.group(0)
+       final_solution = final_solution.split('#### ')[1].replace(',', '')
+       return final_solution
+
+   instruction_following = "Let's think step by step and output the final answer after \"####\"."
+
+   # add a row to each data item that represents a unique id
+   def make_map_fn(split):
+
+       def process_fn(example, idx):
+           question = example.pop('question')
+
+           question = question + ' ' + instruction_following
+
+           answer = example.pop('answer')
+           solution = extract_solution(answer)
+           data = {
+               "data_source": data_source,
+               "prompt": [{
+                   "role": "user",
+                   "content": question
+               }],
+               "ability": "math",
+               "reward_model": {
+                   "style": "rule",
+                   "ground_truth": solution
+               },
+               "extra_info": {
+                   'split': split,
+                   'index': idx
+               }
+           }
+           return data
+
+       return process_fn
--- a/docs/preparation/reward_function.rst
+++ b/docs/preparation/reward_function.rst
+Implment Reward Function for Dataset
+=======================
+
+For each dataset, we need to implement a reward function or utilize a reward model to compute the rewards for the generated responses.
+We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
+
+Currently, we support reward functions for GSM8k and MATH datasets. For RLHF datasets (e.g.,
+full_hh_rlhf) and Code Generation (e.g., APPS), we utilize reward model
+and SandBox (will opensource soon) for evaluation respectively.
+
+RewardManager
+-------------
+
+In the entrypoint of the PPO Post-Training script `main_ppo.py <https://github.com/volcengine/verl/blob/main/verl/trainer/main_ppo.py#L33>`_,
+we implement a ``RewardManager`` that utilze pre-implemented reward functions to compute the scores for each response.
+
+In the ``RewardManager``, we implemented a ``__call__`` function to
+compute the score for each response. 
+All the reward functions are executed by ``compute_score_fn``.
+The input is a ``DataProto``, which includes:
+
+- ``input_ids``, ``attention_mask``: ``input_ids`` and ``attention_mask`` after applying
+  chat_template, including prompt and response
+- ``responses``: response tokens
+- ``ground_truth``: The ground truth string of the current prompt.
+  Stored in ``non_tensor_batch`` in the ``DataProto``, which should be
+  preprocessed in the parquet files.
+- ``data_source``: The dataset name of the current prompt. Stored in
+  ``non_tensor_batch`` in the ``DataProto``, which should be
+  preprocessed in the parquet files.
+
+After detokenize the responses, the responses string and the ground
+truth string will be input to the ``compute_score_fn`` to compute the
+score for each response.
+
+Reward Functions
+----------------
+We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
+
+- In the `GSM8k example <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/gsm8k.py>`_, we
+  force the response to output the final answer after four ####, then
+  use string matching to compare with the ground truth. If completely
+  correct, score 1 point; if the format is correct, score 0.1 points; if
+  the format is incorrect, score 0 points.
+- In the `MATH example <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score/math.py>`_, we follow
+  the implementation in `lm-evaluation-harness repository <https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/hendrycks_math/utils.py>`_.
--- a/docs/requirements-docs.txt
+++ b/docs/requirements-docs.txt
+# markdown suport
+recommonmark
+# markdown table suport
+sphinx-markdown-tables
+
+# theme default rtd
+
+# crate-docs-theme
+sphinx-rtd-theme
\ No newline at end of file
--- a/docs/workers/fsdp_workers.rst
+++ b/docs/workers/fsdp_workers.rst
+PyTorch FSDP Backend
+============
+
+We support PyTorch FSDP Backend by implementing various workers for
+actor, critic, reference, rollout and reward models. We also implement
+the ``FSDPVLLMShardingManager`` that reshard weight between FSDP and
+vLLM in `fsdp_vllm.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/hybrid_engine/fsdp_vllm.py>`_.
+
+**Pros**
+
+- Readily support various models.
+
+  - Users only need to implement the corresponding
+    ``dtensor_weight_loader`` for weight synchronization between FSDP
+    and vLLM. While for ``hf_weight_loader``, users can directly apply
+    any models supported both in HF and vLLM without any code change.
+
+- Easy to organize the forward and backward computation for each model.
+
+**Cons**
+
+- Poor scalability when it comes to large-scale models (e.g. Llama 70B
+  and 405B)
+- The resharding overhead between actor and rollout could be larger than
+  Megatron-LM backend.
+
+Due to the simplicity, we recommend using FSDP backend for algorithm
+research and prototyping.
+
+FSDP Workers
+------------
+
+ActorRolloutRefWorker
+^^^^^^^^^^^^^^^^^^^^^
+
+Actor/Rollout HybridEngine
+''''''''''''''''''''''''''
+
+1. HybridEngine, Actor and Rollout initialization API.
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.ONE_TO_ALL)
+   def init_model(self):
+
+``ONE_TO_ALL``: when calling the ``init_model`` function from the driver
+process, each worker (on a GPU) will execute the following model
+initialization process.
+
+The initialization details of HybridEngine, Actor and Rollout are
+highlighted below:
+
+1. ``DataParallelPPOActor`` implements the simple PPO computation logics
+   when the model is built with FSDP, including compute log prob, model
+   update.
+2. ``vLLMRollout`` support generation with vLLM. We modify the vLLM
+   Engine and make it executed under SPMD to fit into our
+   ``WorkerGroup`` design.
+3. ``FSDPVLLMShardingManager`` a context manager to perform actual
+   resharding between actor and rollout.
+
+See `source code <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/fsdp_workers.py#L42>`_. for more information.
+
+1. Generate sequence and recompute log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def generate_sequences(self, prompts: DataProto):
+
+- ``Dispatch.DP_COMPUTE_PROTO``: The data will be dispatched and
+  collected along the DP dimension
+
+- In this function, the rollout model will perform auto-regressive
+  generation and the actor model will recompute the old log prob for the
+  generetad response.
+
+3. Update actor model
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def update_actor(self, data: DataProto):
+
+- Update the actor model weight using PPO & entropy loss.
+
+ReferenceModel
+''''''''''''''
+
+1. Reference model initialization
+
+The reference model is initialized using the same function as the actor
+model without initializing the HybridEngine and Optimizer. Then the
+actor model is also wrapped by the ``DataParallelPPOActor``.
+
+2. Compute reference log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def compute_ref_log_prob(self, data: DataProto):
+
+- In this function, the reference model will call the compute log prob
+  function in ``DataParallelPPOActor`` to compute the reference log
+  prob.
+
+CriticWorker and RewardWorker
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. Model initialization
+
+Quite similar to reference model. The CriticWorker will perform
+additional initialization for the Optimizer.
+
+2. Compute Values for CriticWorker
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def compute_values(self, data: DataProto):
+
+3. Update Critic
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def update_critic(self, data: DataProto):
+
+4. Compute Reward
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO)
+   def compute_rm_score(self, data: DataProto):
+
+
+HybridShard
+------------
+
+We didn’t support FSDP `HybridShard`. To support this, we may need to
+construct a 2D device mesh and test the corresponding
+``dtensor_weight_loader`` and ``hf_weight_loader`` for each model.
\ No newline at end of file
--- a/docs/workers/megatron_workers.rst
+++ b/docs/workers/megatron_workers.rst
+Megatron-LM Backend
+================
+
+We support Megatron Backend by implementing various workers for actor,
+critic, reference, rollout and reward models. We also implement the
+``3DHybridEngine`` using Megatron-LM and vLLM in `megatron_vllm.py <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/hybrid_engine/megatron_vllm.py>`_.
+
+**Pros**
+
+- Support 3D parallelism and sequence parallelism for best scalablility
+  and throughput.
+- 3D HybridEngine can significantly reduce peak memory usage and reduce
+  weight synchronize overhead between actor and rollout.
+
+**Cons**
+
+- Users should implement their own models for Megatron-LM
+- Users should implement the corresponding weight_loader to
+
+  - synchronize the model weight between actor (in Megatron) and rollout
+    (in vLLM).
+  - load weights from checkpoints to corresponding model in Megatron-LM
+
+Megatron Workers
+----------------
+
+MegatronWorker
+^^^^^^^^^^^^^^
+
+``MegatronWorker`` is the base class of different megatron worker
+classes. In this class, ``get_megatron_global_info`` and
+``get_megatron_rank_info`` function to retrive the 3D parallel world
+size and rank of each ``Worker`` running on specific GPU. These information
+will be used in transfer protocol for Megatron Backend.
+
+The following ``Worker`` class for different models will be utilized to
+construct the ``WorkerGroup`` .
+
+We implement various of APIs for each ``Worker`` class decorated by the
+``@register(dispatch_mode=)`` . These APIs can be called by the ray
+driver process. The data can be correctly collect and dispatch following
+the ``dispatch_mode`` on each function. The supported dispatch_model
+(i.e., transfer protocols) can be found in `decorator.py <https://github.com/volcengine/verl/blob/main/single_controller/base/decorator.py>`_.
+
+ActorRolloutRefWorker
+^^^^^^^^^^^^^^^^^^^^^
+
+This class is implemented for Actor/Rollout HybridEngine or for the
+reference model to initialize their model and perform computation.
+
+Actor/Rollout HybridEngine
+''''''''''''''''''''''''''
+
+1. HybridEngine, Actor and Rollout initialization API.
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.ONE_TO_ALL)
+   def init_model(self):
+
+``ONE_TO_ALL``: when calling the ``init_model`` function from the driver
+process, each worker (on a GPU) will execute the following model
+initialization process.
+
+The initialization details of HybridEngine, Actor and Rollout are
+highlighted below:
+
+1. ``AllGatherPPModel`` holds memory buffer for both Actor and Rollout
+   and support weight resharding between actor and rollout.
+2. ``MegatronPPOActor`` implements the simple PPO computation logics
+   when the model is built with Megatron, including compute log prob,
+   model update.
+3. ``vLLMRollout`` support generation with vLLM. We modify the vLLM
+   Engine and make it executed under SPMD to fit into our
+   ``WorkerGroup`` design.
+4. ``MegatronVLLMShardingManager`` a context manager to perform actual
+   resharding between actor and rollout.
+
+See `source code <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/megatron_workers.py#L63>`_ for more information.
+
+.. code:: python
+
+   # Initialize the 3D HybridEngine
+   hybrid_engine = AllGatherPPModel(model_provider=megatron_actor_model_provider)
+   # Fetch the model at current rank
+   actor_module = hybrid_engine.this_rank_models
+   ...
+
+   # build actor model
+   self.actor = MegatronPPOActor(config=self.config.actor,
+                                 model_config=self.actor_model_config,
+                                 megatron_config=megatron_config,
+                                 actor_module=self.actor_module,
+                                 actor_optimizer=self.actor_optimizer,
+                                 actor_optimizer_config=self.actor_optim_config)
+
+   # build rollout
+   # rollout initialization
+   rollout = vLLMRollout(actor_module=params,
+                        config=self.config.rollout,
+                        tokenizer=self.tokenizer,
+                        model_hf_config=self.actor_model_config,
+                        train_tp=mpu.get_tensor_model_parallel_world_size())
+   # perform weight resharding between actor and rollout
+   sharding_manager = MegatronVLLMShardingManager(module=self.hybrid_engine,
+                                                  inference_engine=rollout.inference_engine,
+                                                  model_config=self.actor_model_config,
+                                                  layer_name_mapping=layer_name_mapping)
+   ...
+
+2. Generate sequence and recompute log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_PP_AS_DP_PROTO)
+   def generate_sequences(self, prompts: DataProto):
+
+- ``Dispatch.MEGATRON_PP_AS_DP_PROTO``: The PP dimension of the actor
+  model will be regarded as DP dimension. Then the driver process will
+  dispatch and collect the data according to this reorganization. This
+  is because, in HybridEngine, the actor weight, which usually applied
+  larger 3D parallel sizes, will be gathered along the PP dimension and
+  TP dimension. Therefore, the corresponding data should be dispatched
+  and collected through the 3D parallel group of the rollout model,
+  rather than the actor model. However, the world_size and rank
+  information can only be retrived from ``get_megatron_global_info`` and
+  ``get_megatron_rank_info``, which records the 3D information for the
+  actor model. Moreover, the data resharding inside TP dimension will be
+  processed within the HybridEngine.
+
+- In this function, the rollout model will perform auto-regressive
+  generation and the actor model will recompute the old log prob for the
+  generetad response.
+
+3. Update actor model
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def update_actor(self, data: DataProto):
+
+- ``Dispatch.MEGATRON_COMPUTE_PROTO``: User passes the data partitioned
+  by DP dimension. The data is dispatched to all tp/pp ranks within the
+  same dp group, and ultimately only collects output data from tp=0 and
+  the last pp.
+- Update the actor model weight using PPO & entropy loss.
+
+ReferenceModel
+''''''''''''''
+
+1. Reference model initialization
+
+The reference model is initialized using the same function as the actor
+model without initializing the HybridEngine and Optimizer. Then the
+actor model is also wrapped by the ``MegatronPPOActor``.
+
+2. Compute reference log prob
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_ref_log_prob(self, data: DataProto):
+
+- In this function, the reference model will call the compute log prob
+  function in ``MegatronPPOActor`` to compute the reference log prob.
+
+CriticWorker and RewardWorker
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. Model initialization
+
+Quite similar to reference model. The CriticWorker will perform
+additional initialization for the Optimizer.
+
+2. Compute Values for CriticWorker
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_values(self, data: DataProto):
+
+3. Update Critic
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def update_critic(self, data: DataProto):
+
+4. Compute Reward
+
+.. code:: python
+
+   @register(dispatch_mode=Dispatch.MEGATRON_COMPUTE_PROTO)
+   def compute_rm_score(self, data: DataProto):
+
+Context Parallel
+----------------
+
+This require the developer/contributor to implement the context parallel
+both in Megatron-LM and models.
--- a/docs/workers/ray_trainer.rst
+++ b/docs/workers/ray_trainer.rst
--- a/examples/data_preprocess/full_hh_rlhf.py
+++ b/examples/data_preprocess/full_hh_rlhf.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+- Preprocess data and split the training set into 75% for training RM and 25% for validting RM.
+- All the training data is used to train SFT and RL.
+- Both chosen and rejected is used to train SFT
+"""
+import argparse
+import os
+
+import pandas as pd
+from datasets import load_dataset
+
+from tqdm.auto import tqdm
+
+from verl.utils.fs import copy, makedirs
+
+
+def generate_sft_dataset(target_hdfs_path_dir, local_dir='~/data/full_hh_rlh/sft'):
+    dataset = load_dataset('Dahoas/full-hh-rlhf')
+    output = {'prompt': [], 'response': []}
+    for data in tqdm(dataset['train']):
+        # add chosen
+        output['prompt'].append(data['prompt'])
+        output['response'].append(data['chosen'])
+
+        # add rejection
+        output['prompt'].append(data['prompt'])
+        output['response'].append(data['rejected'])
+
+    df = pd.DataFrame(output)
+
+    local_dir = os.path.expanduser(local_dir)
+    os.makedirs(local_dir, exist_ok=True)
+
+    local_path = os.path.join(local_dir, 'train.parquet')
+
+    df.to_parquet(path=local_path)
+
+    if target_hdfs_path_dir is not None:
+        hdfs_dir = target_hdfs_path_dir + '/' + 'train.parquet'
+        makedirs(hdfs_dir)
+
+        copy(local_path, hdfs_dir)
+
+
+def generate_rm_dataset(target_hdfs_path_dir, local_dir='~/data/full_hh_rlh/rm'):
+    train_dataset = load_dataset('Dahoas/full-hh-rlhf', split='train[:75%]')
+    test_dataset = load_dataset('Dahoas/full-hh-rlhf', split='train[-25%:]')
+
+    local_dir = os.path.expanduser(local_dir)
+    os.makedirs(local_dir, exist_ok=True)
+
+    for dataset, name in zip([train_dataset, test_dataset], ['train', 'test']):
+        output = {'prompt': [], 'chosen': [], 'rejected': []}
+        for data in tqdm(dataset):
+            # add chosen
+            output['prompt'].append(data['prompt'])
+            output['chosen'].append(data['chosen'])
+            output['rejected'].append(data['rejected'])
+
+        df = pd.DataFrame(output)
+
+        local_path = os.path.join(local_dir, name + '.parquet')
+
+        df.to_parquet(path=local_path)
+
+        if target_hdfs_path_dir is not None:
+            hdfs_dir = target_hdfs_path_dir + '/' + name + '.parquet'
+            makedirs(hdfs_dir)
+
+            copy(local_path, hdfs_dir)
+
+
+def generate_rl_dataset(target_hdfs_path_dir, local_dir='~/data/full_hh_rlhf/rl'):
+    dataset = load_dataset('Dahoas/full-hh-rlhf')
+    train_dataset = dataset['train']
+
+    data_source = 'Dahoas/full-hh-rlhf'
+    
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+
+        def process_fn(example, idx):
+            prompt = example.pop('prompt')
+            response = example.pop('response')
+
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": prompt
+                }],
+                "ability": "alignment",
+                "reward_model": {
+                    "style": "model",
+                    "ground_truth": response # should not be used
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx
+                }
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+    local_dir = os.path.expanduser(local_dir)
+    local_path = os.path.join(local_dir, 'train.parquet')
+    train_dataset.to_parquet(local_path)
+
+    if target_hdfs_path_dir is not None:
+        hdfs_dir = target_hdfs_path_dir + '/' + 'train.parquet'
+        makedirs(hdfs_dir)
+
+        copy(local_path, hdfs_dir)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--split', type=str, choices=['sft', 'rm', 'rl'], required=True)
+    parser.add_argument('--local_dir', type=str, default='~/data/full_hh_rlhf')
+    parser.add_argument('--hdfs_dir', type=str, required=False, default=None)
+
+    args = parser.parse_args()
+
+    if args.split == 'sft':
+        generate_sft_dataset(args.hdfs_dir, os.path.join(args.local_dir, args.split))
+    elif args.split == 'rm':
+        generate_rm_dataset(args.hdfs_dir, os.path.join(args.local_dir, args.split))
+    elif args.split == 'rl':
+        generate_rl_dataset(args.hdfs_dir, os.path.join(args.local_dir, args.split))
+    else:
+        raise NotImplementedError
--- a/examples/data_preprocess/gsm8k.py
+++ b/examples/data_preprocess/gsm8k.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the GSM8k dataset to parquet format
+"""
+
+import re
+import os
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+import argparse
+
+
+def extract_solution(solution_str):
+    solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str)
+    assert solution is not None
+    final_solution = solution.group(0)
+    final_solution = final_solution.split('#### ')[1].replace(',', '')
+    return final_solution
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_dir', default='~/data/gsm8k')
+    parser.add_argument('--hdfs_dir', default=None)
+
+    args = parser.parse_args()
+
+    num_few_shot = 5
+    data_source = 'openai/gsm8k'
+
+    dataset = datasets.load_dataset(data_source, 'main')
+
+    train_dataset = dataset['train']
+    test_dataset = dataset['test']
+
+    instruction_following = "Let's think step by step and output the final answer after \"####\"."
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+
+        def process_fn(example, idx):
+            question = example.pop('question')
+
+            question = question + ' ' + instruction_following
+
+            answer = example.pop('answer')
+            solution = extract_solution(answer)
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": question
+                }],
+                "ability": "math",
+                "reward_model": {
+                    "style": "rule",
+                    "ground_truth": solution
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx
+                }
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+    test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+    test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+
+        copy(src=local_dir, dst=hdfs_dir)
--- a/examples/data_preprocess/hellaswag.py
+++ b/examples/data_preprocess/hellaswag.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess Hellaswag dataset.
+
+"""
+
+import re
+import os
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+import argparse
+
+
+def preprocess(text):
+    text = text.strip()
+    # NOTE: Brackets are artifacts of the WikiHow dataset portion of HellaSwag.
+    text = text.replace(" [title]", ". ")
+    text = re.sub("\\[.*?\\]", "", text)
+    text = text.replace("  ", " ")
+    return text
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_dir', default='/opt/tiger/hellaswag')
+    parser.add_argument('--hdfs_dir', default=None)
+
+    args = parser.parse_args()
+
+    data_source = 'Rowan/hellaswag'
+
+    dataset = datasets.load_dataset(data_source, trust_remote_code=True)
+
+    train_dataset = dataset['train']
+    val_dataset = dataset['validation']
+    test_dataset = dataset['test']
+
+    instruction = 'Please complete the following sentence.\n'
+
+    def make_map_fn(split):
+
+        def process_fn(doc, idx):
+            ctx = doc["ctx_a"] + " " + doc["ctx_b"].capitalize()
+            query = preprocess(doc["activity_label"] + ": " + ctx)
+            choices = [preprocess(ending) for ending in doc["endings"]]
+            gold = int(doc["label"])
+
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": query
+                }],
+                "ability": "nlp",
+                "reward_model": {
+                    "style": "model",
+                    "eval": "multiple_choice",  # using loglikelihood
+                    "ground_truth": gold,
+                    "choices": choices
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx
+                }
+            }
+            return data
+
+        return process_fn
+
+    # filter data that doesn't have a label
+    train_dataset = train_dataset.filter(lambda x: len(x['label']) > 0)
+    val_dataset = val_dataset.filter(lambda x: len(x['label']) > 0)
+    test_dataset = test_dataset.filter(lambda x: len(x['label']) > 0)
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+    val_dataset = val_dataset.map(function=make_map_fn('validation'), with_indices=True)
+    test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+    val_dataset.to_parquet(os.path.join(local_dir, 'validation.parquet'))
+    test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+
+        copy(src=local_dir, dst=hdfs_dir)
--- a/examples/data_preprocess/math.py
+++ b/examples/data_preprocess/math.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the GSM8k dataset to parquet format
+"""
+
+import os
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+import argparse
+
+from verl.utils.reward_score.math import remove_boxed, last_boxed_only_string
+
+
+def extract_solution(solution_str):
+    return remove_boxed(last_boxed_only_string(solution_str))
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--local_dir', default='~/data/math')
+    parser.add_argument('--hdfs_dir', default=None)
+
+    args = parser.parse_args()
+
+    data_source = 'lighteval/MATH'
+
+    dataset = datasets.load_dataset(data_source, trust_remote_code=True)
+
+    train_dataset = dataset['train']
+    test_dataset = dataset['test']
+
+    instruction_following = "Let's think step by step and output the final answer within \\boxed{}."
+
+    # add a row to each data item that represents a unique id
+    def make_map_fn(split):
+
+        def process_fn(example, idx):
+            question = example.pop('problem')
+
+            question = question + ' ' + instruction_following
+
+            answer = example.pop('solution')
+            solution = extract_solution(answer)
+            data = {
+                "data_source": data_source,
+                "prompt": [{
+                    "role": "user",
+                    "content": question
+                }],
+                "ability": "math",
+                "reward_model": {
+                    "style": "rule",
+                    "ground_truth": solution
+                },
+                "extra_info": {
+                    'split': split,
+                    'index': idx
+                }
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn('train'), with_indices=True)
+    test_dataset = test_dataset.map(function=make_map_fn('test'), with_indices=True)
+
+    local_dir = args.local_dir
+    hdfs_dir = args.hdfs_dir
+
+    train_dataset.to_parquet(os.path.join(local_dir, 'train.parquet'))
+    test_dataset.to_parquet(os.path.join(local_dir, 'test.parquet'))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+
+        copy(src=local_dir, dst=hdfs_dir)
--- a/examples/generation/run_deepseek_v2_lite_math.sh
+++ b/examples/generation/run_deepseek_v2_lite_math.sh
+python3 -m verl.trainer.main_generation \
+    trainer.nnodes=1 \
+    trainer.n_gpus_per_node=8 \
+    data.path=~/data/rlhf/gsm8k/test.parquet \
+    data.prompt_key=prompt \
+    data.n_samples=1 \
+    data.output_path=~/data/rlhf/math/deepseek_v2_lite_gen_test.parquet \
+    model.path=deepseek-ai/deepseek-llm-7b-chat \
+    +model.trust_remote_code=True \
+    rollout.temperature=1.0 \
+    rollout.top_k=50 \
+    rollout.top_p=0.7 \
+    rollout.prompt_length=2048 \
+    rollout.response_length=1024 \
+    rollout.tensor_model_parallel_size=2 \
+    rollout.gpu_memory_utilization=0.8
--- a/examples/ppo_trainer/run_deepseek7b_llm.sh
+++ b/examples/ppo_trainer/run_deepseek7b_llm.sh
+set -x
+
+python3 -m verl.trainer.main_ppo \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.val_batch_size=1312 \
+    data.max_prompt_length=512 \
+    data.max_response_length=512 \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=32 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=128 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size=128 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    critic.optim.lr=1e-5 \
+    critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size=32 \
+    critic.model.fsdp_config.param_offload=False \
+    critic.model.fsdp_config.grad_offload=False \
+    critic.model.fsdp_config.optimizer_offload=False \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','tracking'] \
+    trainer.project_name='verl_example_gsm8k' \
+    trainer.experiment_name='deepseek_llm_7b_function_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.total_epochs=15
\ No newline at end of file
--- a/examples/ppo_trainer/run_deepseek_full_hh_rlhf.sh
+++ b/examples/ppo_trainer/run_deepseek_full_hh_rlhf.sh
+set -x
+
+train_files=$HOME/data/full_hh_rlhf/rl/train.parquet
+test_files=$HOME/data/full_hh_rlhf/rl/train.parquet # no use
+
+python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=512 \
+    data.val_batch_size=128 \
+    data.max_prompt_length=128 \
+    data.max_response_length=128 \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=16 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=16 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size=16 \
+    actor_rollout_ref.ref.param_offload=False \
+    critic.optim.lr=1e-5 \
+    critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size=16 \
+    reward_model.enable=True \
+    reward_model.megatron.tensor_model_parallel_size=4 \
+    reward_model.model.path=deepseek-ai/deepseek-llm-7b-chat \
+    reward_model.micro_batch_size=16 \
+    reward_model.param_offload=False \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','tracking'] \
+    trainer.project_name='verl_megatron_full_hh_rlhf_examples' \
+    trainer.experiment_name='deepseek_llm_7b_model_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=100
\ No newline at end of file
--- a/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron.sh
+++ b/examples/ppo_trainer/run_deepseek_math_gsm8k_megatron.sh
+set -x
+
+gsm8k_train_path=$HOME/data/gsm8k/train.parquet
+gsm8k_test_path=$HOME/data/gsm8k/test.parquet
+math_train_path=$HOME/data/math/train.parquet
+math_test_path=$HOME/data/math/test.parquet
+
+train_files="['$gsm8k_train_path', '$math_train_path']"
+test_files="['$gsm8k_test_path', '$math_test_path']"
+
+python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=1024 \
+    data.val_batch_size=6312 \
+    data.max_prompt_length=1024 \
+    data.max_response_length=512 \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=32 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=32 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size=32 \
+    critic.optim.lr=1e-5 \
+    critic.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size=32 \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','tracking'] \
+    trainer.project_name='verl_megatron_math_gsm8k_examples' \
+    trainer.experiment_name='deepseek_llm_7b_function_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=100
\ No newline at end of file
--- a/examples/ppo_trainer/run_deepseek_megatron.sh
+++ b/examples/ppo_trainer/run_deepseek_megatron.sh
+set -x
+
+python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megatron_trainer'\
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=1024 \
+    data.val_batch_size=1312 \
+    data.max_prompt_length=512 \
+    data.max_response_length=512 \
+    actor_rollout_ref.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
+    actor_rollout_ref.actor.optim.lr=2e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=64 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=64 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size=128 \
+    critic.optim.lr=2e-5 \
+    critic.model.path=deepseek-ai/deepseek-coder-6.7b-instruct \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size=64 \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','tracking'] \
+    trainer.project_name='verl_megatron_gsm8k_examples' \
+    trainer.experiment_name='deepseek_llm_7b_function_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.total_epochs=15
\ No newline at end of file
--- a/examples/ppo_trainer/run_qwen2-7b.sh
+++ b/examples/ppo_trainer/run_qwen2-7b.sh
+set -x
+
+gsm8k_train_path=$HOME/data/gsm8k/train.parquet
+gsm8k_test_path=$HOME/data/gsm8k/test.parquet
+math_train_path=$HOME/data/math/train.parquet
+math_test_path=$HOME/data/math/test.parquet
+
+train_files="['$gsm8k_train_path', '$math_train_path']"
+test_files="['$gsm8k_test_path', '$math_test_path']"
+
+python3 -m verl.trainer.main_ppo \
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=1024 \
+    data.val_batch_size=6312 \
+    data.max_prompt_length=1024 \
+    data.max_response_length=512 \
+    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=16 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=16 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size=16 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    critic.optim.lr=1e-5 \
+    critic.model.path=Qwen/Qwen2-7B-Instruct \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size=16 \
+    critic.model.fsdp_config.param_offload=False \
+    critic.model.fsdp_config.grad_offload=False \
+    critic.model.fsdp_config.optimizer_offload=False \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','tracking'] \
+    trainer.project_name='verl_example' \
+    trainer.experiment_name='Qwen2-7B-Instruct_function_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=10 \
+    trainer.total_epochs=15
\ No newline at end of file
--- a/examples/ppo_trainer/run_qwen2-7b_rm.sh
+++ b/examples/ppo_trainer/run_qwen2-7b_rm.sh
+set -x
+# Discliamer: the model used in the script is only for academic example,
+gsm8k_train_path=$HOME/data/gsm8k/train.parquet
+gsm8k_test_path=$HOME/data/gsm8k/test.parquet
+math_train_path=$HOME/data/math/train.parquet
+math_test_path=$HOME/data/math/test.parquet
+
+train_files="['$gsm8k_train_path', '$math_train_path']"
+test_files="['$gsm8k_test_path', '$math_test_path']"
+
+python3 -m verl.trainer.main_ppo \
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=1024 \
+    data.val_batch_size=6312 \
+    data.max_prompt_length=1024 \
+    data.max_response_length=512 \
+    data.return_raw_chat=True \
+    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=16 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=16 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size=16 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    critic.optim.lr=1e-5 \
+    critic.optim.lr_warmup_steps_ratio=0.05 \
+    critic.model.path=Qwen/Qwen2-7B-Instruct \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size=16 \
+    critic.model.fsdp_config.param_offload=False \
+    critic.model.fsdp_config.grad_offload=False \
+    critic.model.fsdp_config.optimizer_offload=False \
+    reward_model.enable=True \
+    reward_model.model.path=sfairXC/FsfairX-Gemma2-RM-v0.1\
+    reward_model.model.fsdp_config.param_offload=True \
+    reward_model.micro_batch_size=16 \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','tracking'] \
+    trainer.project_name='verl_example' \
+    trainer.experiment_name='Qwen2-7B-Instruct_hybrid_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=5 \
+    trainer.total_epochs=15
\ No newline at end of file
--- a/examples/ppo_trainer/run_qwen2.5-32b.sh
+++ b/examples/ppo_trainer/run_qwen2.5-32b.sh
+set -x
+
+gsm8k_train_path=$HOME/data/gsm8k/train.parquet
+gsm8k_test_path=$HOME/data/gsm8k/test.parquet
+math_train_path=$HOME/data/math/train.parquet
+math_test_path=$HOME/data/math/test.parquet
+
+train_files="['$gsm8k_train_path', '$math_train_path']"
+test_files="['$gsm8k_test_path', '$math_test_path']"
+
+python3 -m verl.trainer.main_ppo \
+    data.train_files="$train_files" \
+    data.val_files="$test_files" \
+    data.train_batch_size=1024 \
+    data.val_batch_size=6304 \
+    data.max_prompt_length=1024 \
+    data.max_response_length=1024 \
+    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
+    actor_rollout_ref.model.enable_gradient_checkpointing=False \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=16 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=False \
+    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=128 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
+    actor_rollout_ref.rollout.name=vllm \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.3 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size=128 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=True \
+    critic.optim.lr=1e-5 \
+    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
+    critic.model.enable_gradient_checkpointing=False \
+    critic.ppo_micro_batch_size=32 \
+    critic.model.fsdp_config.param_offload=False \
+    critic.model.fsdp_config.grad_offload=False \
+    critic.model.fsdp_config.optimizer_offload=False \
+    algorithm.kl_ctrl.kl_coef=0.0001 \
+    trainer.critic_warmup=0 \
+    trainer.logger=['console','tracking'] \
+    trainer.project_name='verl_example' \
+    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=4 \
+    trainer.save_freq=-1 \
+    trainer.test_freq=10 \
+    trainer.total_epochs=15
\ No newline at end of file
--- a/examples/ray/tutorial.ipynb
+++ b/examples/ray/tutorial.ipynb
--- a/examples/sft/gsm8k/run_deepseek_6b7.sh
+++ b/examples/sft/gsm8k/run_deepseek_6b7.sh
+set -x
+
+hdfs_path=hdfs://user/verl/experiments/gsm8k/deepseek-coder-6.7b-instruct/ # replace to your own hdfs/local path
+
+TORCHRUN -m verl.trainer.fsdp_sft_trainer \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.prompt_key=question \
+    data.response_key=answer \
+    data.micro_batch_size=8 \
+    model.partial_pretrain=deepseek-ai/deepseek-coder-6.7b-instruct \
+    trainer.default_hdfs_dir=$hdfs_path \
+    trainer.project_name=gsm8k-sft \
+    trainer.experiment_name=gsm8k-sft-deepseek-coder-6.7b-instruct \
+    trainer.total_epochs=4 \
+    trainer.logger=['console','tracking']
\ No newline at end of file
--- a/examples/sft/gsm8k/run_gemma_2b.sh
+++ b/examples/sft/gsm8k/run_gemma_2b.sh
+# Tested in 4 GPUs
+
+set -x
+
+hdfs_path=hdfs://user/verl/experiments/gsm8k/gemma-2b-it/ # replace to your own hdfs/local path
+
+TORCHRUN -m verl.trainer.fsdp_sft_trainer \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.prompt_key=question \
+    data.response_key=answer \
+    data.micro_batch_size=32 \
+    model.partial_pretrain=google/gemma-2b-it \
+    trainer.default_hdfs_dir=$hdfs_path \
+    trainer.project_name=gsm8k-sft \
+    trainer.experiment_name=gsm8k-sft-gemma-2b-it \
+    trainer.total_epochs=3 \
+    trainer.logger=['console','tracking']
\ No newline at end of file
--- a/examples/sft/gsm8k/run_gemma_7b.sh
+++ b/examples/sft/gsm8k/run_gemma_7b.sh
+set -x
+
+hdfs_path=hdfs://user/verl/experiments/gsm8k/gemma-1.1-7b-it/ # replace to your own hdfs/local path
+
+TORCHRUN -m verl.trainer.fsdp_sft_trainer \
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.prompt_key=question \
+    data.response_key=answer \
+    data.micro_batch_size=8 \
+    model.partial_pretrain=google/gemma-1.1-7b-it \
+    trainer.default_hdfs_dir=$hdfs_path \
+    trainer.project_name=gsm8k-sft \
+    trainer.experiment_name=gsm8k-sft-gemma-1.1-7b-it \
+    trainer.total_epochs=4 \
+    trainer.logger=['console','tracking']
\ No newline at end of file
--- a/patches/megatron_v4.patch
+++ b/patches/megatron_v4.patch
--- a/requirements.txt
+++ b/requirements.txt
+transformers 
+hydra-core 
+tensordict < 0.3.1
+numpy 
+pytest 
+deepspeed 
+pybind11 
+codetiming
+yapf
+wandb
+git+https://github.com/NVIDIA/TransformerEngine.git@stable
+# vllm==0.5.4 # vllm is installed in image building to avoid ray conflicts
\ No newline at end of file
--- a/setup.py
+++ b/setup.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from setuptools import setup, find_packages
+import os
+
+version_folder = os.path.dirname(os.path.join(os.path.abspath(__file__)))
+
+with open(os.path.join(version_folder, 'verl/version/version')) as f:
+    __version__ = f.read().strip()
+
+# TODO: add version info to requirements
+install_requires = [
+    'tensordict',
+    'transformers<4.45',
+    'codetiming',
+    'pybind11',
+    'hydra-core',
+    'numpy',
+    'pytest',
+    'yapf',
+    "dill",
+    "accelerate"
+]
+
+install_optional = [
+    'vllm==0.5.4',
+    'liger-kernel'
+]
+
+extras_require = {
+    'demo': ['hydra-core', 'transformers', ''],
+    'single-controller': ['ray', 'kubernetes'],
+    'single-controller-ray': ['ray'],
+}
+
+setup(
+    name='verl',
+    version=__version__,
+    package_dir={'': '.'},
+    packages=find_packages(where='.'),
+    url='https://github.com/volcengine/verl',
+    license='Apache 2.0',
+    author='Bytedance - Seed - MLSys',
+    author_email='zhangchi.usc1992@bytedance.com, gmsheng@connect.hku.hk',
+    description='veRL: Volcano Engine Reinforcement Learning for LLM',
+    install_requires=install_requires,
+    extras_require=extras_require,
+    package_data={'': ['version/*']},
+    include_package_data=True,
+)
--- a/single_controller/__init__.py
+++ b/single_controller/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+version_folder = os.path.dirname(os.path.join(os.path.abspath(__file__)))
+
+with open(os.path.join(version_folder, 'version/version')) as f:
+    __version__ = f.read().strip()
--- a/single_controller/base/__init__.py
+++ b/single_controller/base/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .worker import Worker
+from .worker_group import WorkerGroup, ClassWithInitArgs, ResourcePool
--- a/single_controller/base/decorator.py
+++ b/single_controller/base/decorator.py
--- a/single_controller/base/dp.py
+++ b/single_controller/base/dp.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from single_controller.base.worker import Worker
+
+
+class DPEngineWorker(Worker):
+
+    def __init__(self, *args, **kwargs):
+        # todo: extract _world_size etc. from kwargs and inject in super().__init__()
+        Worker.__init__(self, *args, **kwargs)
+
+    def init(self):
+        raise NotImplementedError
+
+    def add_engine(self, model, dp_config):
+        raise NotImplementedError
+
+    def execute_engine(self, method_name, *args, **kwargs):
+        print(f"execute_engine called with method={method_name}")
+        func = getattr(self._engine, method_name)
+        return func(*args, **kwargs)
+
+    def execute_module(self, method_name, *args, **kwargs):
+        print(f"execute_module called with method={method_name}")
+        func = getattr(self._engine.module, method_name)
+        return func(*args, **kwargs)
+
+    def get_model_size_on_rank_zero(self):
+        import torch
+        from verl.utils.model import get_model_size
+        if torch.distributed.get_rank() == 0:
+            # print("model print on rank 0: ", self._model)
+            module_size, module_size_scale = get_model_size(self._model)
+            return module_size, module_size_scale
+        return None
--- a/single_controller/base/megatron/__init__.py
+++ b/single_controller/base/megatron/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/single_controller/base/megatron/worker.py
+++ b/single_controller/base/megatron/worker.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass
+from single_controller.base.worker import Worker, DistRankInfo, DistGlobalInfo
+
+
+class MegatronWorker(Worker):
+
+    def __init__(self, cuda_visible_devices=None) -> None:
+        super().__init__(cuda_visible_devices)
+
+    def get_megatron_global_info(self):
+        from megatron.core import parallel_state as mpu
+        tp_size = mpu.get_tensor_model_parallel_world_size()
+        dp_size = mpu.get_data_parallel_world_size()
+        pp_size = mpu.get_pipeline_model_parallel_world_size()
+        info = DistGlobalInfo(tp_size=tp_size, dp_size=dp_size, pp_size=pp_size)
+        return info
+
+    def get_megatron_rank_info(self):
+        from megatron.core import parallel_state as mpu
+        tp_rank = mpu.get_tensor_model_parallel_rank()
+        dp_rank = mpu.get_data_parallel_rank()
+        pp_rank = mpu.get_pipeline_model_parallel_rank()
+        info = DistRankInfo(tp_rank=tp_rank, dp_rank=dp_rank, pp_rank=pp_rank)
+        return info
\ No newline at end of file
--- a/single_controller/base/megatron/worker_group.py
+++ b/single_controller/base/megatron/worker_group.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Dict
+
+from .worker import DistRankInfo, DistGlobalInfo
+from single_controller.base import ResourcePool, WorkerGroup
+
+
+class MegatronWorkerGroup(WorkerGroup):
+
+    def __init__(self, resource_pool: ResourcePool, **kwargs):
+        super().__init__(resource_pool=resource_pool, **kwargs)
+        self._megatron_rank_info = None
+        self._megatron_global_info: DistGlobalInfo = None
+
+    def init_megatron(self, default_megatron_kwargs: Dict = None):
+        raise NotImplementedError(f"MegatronWorkerGroup.init_megatron should be overwritten")
+
+    def get_megatron_rank_info(self, rank: int) -> DistRankInfo:
+        assert 0 <= rank < self.world_size, f'rank must be from [0, world_size), Got {rank}'
+        return self._megatron_rank_info[rank]
+
+    @property
+    def tp_size(self):
+        assert self._megatron_global_info is not None, "MegatronWorkerGroup._megatron_global_info must be initialized"
+        return self._megatron_global_info.tp_size
+
+    @property
+    def dp_size(self):
+        assert self._megatron_global_info is not None, "MegatronWorkerGroup._megatron_global_info must be initialized"
+        return self._megatron_global_info.dp_size
+
+    @property
+    def pp_size(self):
+        assert self._megatron_global_info is not None, "MegatronWorkerGroup._megatron_global_info must be initialized"
+        return self._megatron_global_info.pp_size
+
+    def get_megatron_global_info(self):
+        return self._megatron_global_info
--- a/single_controller/base/register_center/__init__.py
+++ b/single_controller/base/register_center/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/single_controller/base/register_center/ray.py
+++ b/single_controller/base/register_center/ray.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import ray
+
+
+@ray.remote
+class WorkerGroupRegisterCenter:
+
+    def __init__(self, rank_zero_info):
+        self.rank_zero_info = rank_zero_info
+
+    def get_rank_zero_info(self):
+        return self.rank_zero_info
+
+
+def create_worker_group_register_center(name, info):
+    return WorkerGroupRegisterCenter.options(name=name).remote(info)
--- a/single_controller/base/worker.py
+++ b/single_controller/base/worker.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+the class for Worker
+"""
+import os
+import socket
+from dataclasses import dataclass
+from single_controller.base.decorator import register, Dispatch
+
+
+@dataclass
+class DistRankInfo:
+    tp_rank: int
+    dp_rank: int
+    pp_rank: int
+
+
+@dataclass
+class DistGlobalInfo:
+    tp_size: int
+    dp_size: int
+    pp_size: int
+
+
+class WorkerHelper:
+
+    def _get_node_ip(self):
+
+        def get_node_ip_by_sdk():
+            if os.getenv("WG_BACKEND", None) == "ray":
+                import ray
+                return ray._private.services.get_node_ip_address()
+            elif os.getenv("WG_BACKEND", None) == "torch_rpc":
+                from single_controller.torchrpc.k8s_client import get_ip_addr
+                return get_ip_addr()
+            return None
+
+        host_ipv4 = os.getenv("MY_HOST_IP", None)
+        host_ipv6 = os.getenv("MY_HOST_IPV6", None)
+        host_ip_by_env = host_ipv4 or host_ipv6
+        host_ip_by_sdk = get_node_ip_by_sdk()
+
+        host_ip = host_ip_by_env or host_ip_by_sdk
+        return host_ip
+
+    def _get_free_port(self):
+        with socket.socket() as sock:
+            sock.bind(('', 0))
+            return sock.getsockname()[1]
+
+    def get_availale_master_addr_port(self):
+        return self._get_node_ip(), str(self._get_free_port())
+
+    def _get_pid(self):
+        return
+
+
+class WorkerMeta:
+    keys = [
+        "WORLD_SIZE", "RANK", "LOCAL_WORLD_SIZE", "LOCAL_RANK", "MASTER_ADDR", "MASTER_PORT", "CUDA_VISIBLE_DEVICES"
+    ]
+
+    def __init__(self, store) -> None:
+        self._store = store
+
+    def to_dict(self):
+        return {f"_{key.lower()}": self._store.get(f"_{key.lower()}", None) for key in WorkerMeta.keys}
+
+
+# we assume that in each WorkerGroup, there is a Master Worker
+class Worker(WorkerHelper):
+
+    def __new__(cls, *args, **kwargs):
+        instance = super().__new__(cls)
+
+        # note that here we use int to distinguish
+        disable_worker_init = int(os.environ.get('DISABLE_WORKER_INIT', 0))
+        if disable_worker_init:
+            return instance
+
+        rank = os.environ.get("RANK", None)
+        worker_group_prefix = os.environ.get("WG_PREFIX", None)
+
+        # when decorator @ray.remote applies, __new__ will be called while we don't want to apply _configure_before_init
+        if None not in [rank, worker_group_prefix] and 'ActorClass(' not in cls.__name__:
+            instance._configure_before_init(f"{worker_group_prefix}_register_center", int(rank))
+
+        return instance
+
+    def _configure_before_init(self, register_center_name: str, rank: int):
+        assert isinstance(rank, int), f"rank must be int, instead of {type(rank)}"
+
+        if rank == 0:
+            master_addr, master_port = self.get_availale_master_addr_port()
+            rank_zero_info = {
+                "MASTER_ADDR": master_addr,
+                "MASTER_PORT": master_port,
+            }
+
+            if os.getenv("WG_BACKEND", None) == "ray":
+                from single_controller.base.register_center.ray import create_worker_group_register_center
+                self.register_center = create_worker_group_register_center(name=register_center_name,
+                                                                           info=rank_zero_info)
+
+            os.environ.update(rank_zero_info)
+
+    def __init__(self, cuda_visible_devices=None) -> None:
+        # construct a meta from envrionment variable. Note that the import must be inside the class because it is executed remotely
+        import os
+        world_size = int(os.environ['WORLD_SIZE'])
+        rank = int(os.environ['RANK'])
+        self._rank = rank
+        self._world_size = world_size
+
+        master_addr = os.environ["MASTER_ADDR"]
+        master_port = os.environ["MASTER_PORT"]
+
+        local_world_size = int(os.getenv("LOCAL_WORLD_SIZE", "1"))
+        local_rank = int(os.getenv("LOCAL_RANK", "0"))
+
+        store = {
+            '_world_size': world_size,
+            '_rank': rank,
+            '_local_world_size': local_world_size,
+            '_local_rank': local_rank,
+            '_master_addr': master_addr,
+            '_master_port': master_port
+        }
+        if cuda_visible_devices is not None:
+            store['_cuda_visible_devices'] = cuda_visible_devices
+
+        meta = WorkerMeta(store=store)
+        self._configure_with_meta(meta=meta)
+
+    def _configure_with_meta(self, meta: WorkerMeta):
+        """
+        This function should only be called inside by WorkerGroup
+        """
+        assert isinstance(meta, WorkerMeta)
+        self.__dict__.update(meta.to_dict())  # this is hacky
+        # print(f"__dict__: {self.__dict__}")
+        for key in WorkerMeta.keys:
+            val = self.__dict__.get(f"_{key.lower()}", None)
+            if val is not None:
+                # print(f"set {key} to {val}")
+                os.environ[key] = str(val)
+        os.environ["REDIS_STORE_SERVER_HOST"] = str(self._master_addr).replace("[", "").replace(
+            "]", "") if self._master_addr else ""
+
+    def get_master_addr_port(self):
+        return self._master_addr, self._master_port
+
+    def get_cuda_visible_devices(self):
+        import os
+        cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES", "not set")
+        return cuda_visible_devices
+
+    @property
+    def world_size(self):
+        return self._world_size
+
+    @property
+    def rank(self):
+        return self._rank
+
+    @register(dispatch_mode=Dispatch.DP_COMPUTE_PROTO_WITH_FUNC)
+    def execute_with_func_generator(self, func, *args, **kwargs):
+        ret_proto = func(self, *args, **kwargs)
+        return ret_proto
--- a/single_controller/base/worker_group.py
+++ b/single_controller/base/worker_group.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+the class of WorkerGroup
+"""
+import logging
+import threading
+import signal
+import time
+from typing import List, Any, Callable, Dict
+
+from single_controller.base.decorator import MAGIC_ATTR, Dispatch, get_predefined_dispatch_fn, get_predefined_execute_fn
+
+
+class ResourcePool:
+
+    def __init__(self, process_on_nodes=None, max_collocate_count: int = 10, n_gpus_per_node=8) -> None:
+        if process_on_nodes is None:
+            process_on_nodes = []
+        self._store = process_on_nodes
+        self.max_collocate_count = max_collocate_count
+        self.n_gpus_per_node = n_gpus_per_node  # this is left for future huawei GPU that contains 16 GPUs per node
+
+    def add_node(self, process_count):
+        self._store.append(process_count)
+
+    @property
+    def world_size(self):
+        return sum(self._store)
+
+    def __call__(self) -> Any:
+        return self._store
+
+    @property
+    def store(self):
+        return self._store
+
+    def local_world_size_list(self) -> List[int]:
+        nested_local_world_size_list = [
+            [local_world_size for _ in range(local_world_size)] for local_world_size in self._store
+        ]
+        return [item for row in nested_local_world_size_list for item in row]
+
+    def local_rank_list(self) -> List[int]:
+        nested_local_rank_list = [[i for i in range(local_world_size)] for local_world_size in self._store]
+        return [item for row in nested_local_rank_list for item in row]
+
+
+class ClassWithInitArgs:
+    """
+    This class stores a class constructor and the args/kwargs to construct the class.
+    It is used to instantiate the remote class.
+    """
+
+    def __init__(self, cls, *args, **kwargs) -> None:
+        self.cls = cls
+        self.args = args
+        self.kwargs = kwargs
+
+    # def add_arg(self, arg):
+    #     self.args += (arg,)
+
+    # def add_kwarg(self, key, value):
+    #     self.kwargs[key] = value
+
+    def __call__(self) -> Any:
+        return self.cls(*self.args, **self.kwargs)
+
+
+def check_workers_alive(workers: List, is_alive: Callable, gap_time: float = 1) -> None:
+    import time
+    while True:
+        for worker in workers:
+            if not is_alive(worker):
+                logging.warning(f"worker {worker} is not alive" + " sending signal to main thread")
+                signal.raise_signal(signal.SIGABRT)
+        time.sleep(gap_time)
+
+
+class WorkerGroup:
+
+    def __init__(self, resource_pool: ResourcePool, **kwargs) -> None:
+        self._is_init_with_detached_workers = True if resource_pool is None else False
+
+        if resource_pool is not None:
+            # handle the case when WorkGroup is attached to an existing one
+            self._procecss_dispatch_config = resource_pool()
+        else:
+            self._procecss_dispatch_config = None
+
+        self._workers = []
+        self._worker_names = []
+
+        self._master_addr = None
+        self._master_port = None
+
+        self._checker_thread: threading.Thread = None
+
+    def _is_worker_alive(self, worker):
+        raise NotImplementedError(f"WorkerGroup._is_worker_alive called, should be implemented in derived class.")
+
+    def _block_until_all_workers_alive(self) -> None:
+        while True:
+            all_state = [self._is_worker_alive(worker) for worker in self._workers]
+            if False in all_state:
+                time.sleep(1)
+            else:
+                break
+
+    def start_worker_aliveness_check(self, every_n_seconds=1) -> None:
+        # before starting checking worker aliveness, make sure all workers are already alive
+        self._block_until_all_workers_alive()
+
+        self._checker_thread = threading.Thread(target=check_workers_alive,
+                                                args=(self._workers, self._is_worker_alive, every_n_seconds))
+        self._checker_thread.start()
+
+    @property
+    def world_size(self):
+        return len(self._workers)
+
+    # execute_all_async and execute_rank_zero_async should be implemented by RayWorkerGroup, TorchRPCWorkerGroup,
+    # MegatronWorkerGroup, XperfWorkerGroup should skip
+
+    def _bind_worker_method(self, user_defined_cls, func_generator):
+        """
+        Bind the worker method to the WorkerGroup
+        """
+
+        for method_name in dir(user_defined_cls):
+
+            try:
+                method = getattr(user_defined_cls, method_name)
+                assert callable(method), f"{method_name} in {user_defined_cls} is not callable"
+            except Exception as e:
+                # if it is a property, it will fail because Class doesn't have instance property
+                continue
+
+            if hasattr(method, MAGIC_ATTR):
+                # this method is decorated by register
+                attribute = getattr(method, MAGIC_ATTR)
+                assert isinstance(attribute, Dict), f'attribute must be a dictionary. Got {type(attribute)}'
+                assert 'dispatch_mode' in attribute, f'attribute must contain dispatch_mode in its key'
+
+                dispatch_mode = attribute['dispatch_mode']
+                execute_mode = attribute['execute_mode']
+                blocking = attribute['blocking']
+
+                # get dispatch fn
+                if isinstance(dispatch_mode, Dispatch):
+                    # get default dispatch fn
+                    fn = get_predefined_dispatch_fn(dispatch_mode=dispatch_mode)
+                    dispatch_fn = fn['dispatch_fn']
+                    collect_fn = fn['collect_fn']
+                else:
+                    assert isinstance(dispatch_mode, dict)
+                    assert 'dispatch_fn' in dispatch_mode
+                    assert 'collect_fn' in dispatch_mode
+                    dispatch_fn = dispatch_mode['dispatch_fn']
+                    collect_fn = dispatch_mode['collect_fn']
+
+                # get execute_fn_name
+                execute_mode = get_predefined_execute_fn(execute_mode=execute_mode)
+                wg_execute_fn_name = execute_mode['execute_fn_name']
+
+                # get execute_fn from string
+                try:
+                    execute_fn = getattr(self, wg_execute_fn_name)
+                    assert callable(execute_fn), 'execute_fn must be callable'
+                except Exception as e:
+                    print(f'execute_fn {wg_execute_fn_name} is invalid')
+                    raise
+
+                # bind a new method to the RayWorkerGroup
+                func = func_generator(self,
+                                      method_name,
+                                      dispatch_fn=dispatch_fn,
+                                      collect_fn=collect_fn,
+                                      execute_fn=execute_fn,
+                                      blocking=blocking)
+
+                try:
+                    setattr(self, method_name, func)
+                except Exception as e:
+                    raise ValueError(f'Fail to set method_name {method_name}')
--- a/single_controller/ray/__init__.py
+++ b/single_controller/ray/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .base import RayResourcePool, RayClassWithInitArgs, RayWorkerGroup, create_colocated_worker_cls
+from .megatron import (MegatronRayWorkerGroup, DistRankInfo, DistGlobalInfo)
\ No newline at end of file
--- a/single_controller/ray/base.py
+++ b/single_controller/ray/base.py
--- a/single_controller/ray/decorator.py
+++ b/single_controller/ray/decorator.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import functools
+import json
+import os
+
+import ray
+
+# compatiblity cern
+from single_controller.base.decorator import *
+
+
+def maybe_remote(main):
+    """Schedule main function as ray remote task if VERL_DRIVER_NUM_GPUS or VERL_DRIVER_RESOURCES specified in config.
+       - VERL_DRIVER_NUM_GPUS: number of GPUs for driver task.
+       - VERL_DRIVER_RESOURCES: custom resources for driver task, e.g {"verl_driver": 1.0}.
+
+    For job submission to ray cluster, you can specify these two envs in runtime.yaml.
+    ```yaml
+    working_dir: "."
+    env_vars:
+      VERL_DRIVER_NUM_GPUS: "1"
+      VERL_DRIVER_RESOURCES: '{"verl_driver": 1.0}'
+    ```
+
+    ray job submit --runtime-env=runtime.yaml -- python3 test.py
+
+    Args:
+        main (Callable): main function to be schedule.
+    """
+
+    num_gpus = 0
+    resources = {}
+    env_num_gpus = os.getenv("VERL_DRIVER_NUM_GPUS")
+    if env_num_gpus:
+        num_gpus = int(env_num_gpus)
+    env_resources = os.getenv("VERL_DRIVER_RESOURCES")
+    if env_resources:
+        resources = json.loads(env_resources)
+    print(f"verl driver num_gpus: {num_gpus}, resources={resources}")
+    assert isinstance(resources, dict), f"resources must be dict, got {type(resources)}"
+
+    @functools.wraps(main)
+    def _main(*args, **kwargs):
+        # Run main function locally.
+        if num_gpus == 0 and len(resources) == 0:
+            return main(*args, **kwargs)
+
+        # Run main function remotely as ray task.
+        f = ray.remote(num_gpus=num_gpus, resources=resources)(main)
+        return ray.get(f.remote(*args, **kwargs))
+
+    return _main
--- a/single_controller/ray/dist_data_pass_protocol.py
+++ b/single_controller/ray/dist_data_pass_protocol.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from tensordict import TensorDict
+import ray
+
+from verl import DataProto
+
+
+class DistDataProto(DataProto, ray.ObjectRef):
+    ...
+    # skip for prototype, assuming dp size kept among all Roles
--- a/single_controller/ray/dp.py
+++ b/single_controller/ray/dp.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import ray
+
+from single_controller.ray.base import RayWorkerGroup, RayResourcePool, RayClassWithInitArgs
+
+
+@ray.remote
+class RefBasicRayActor:
+    ...
+
+
+class DPEngineRayWorkerGroup(RayWorkerGroup):
+
+    class DummyModule:
+
+        def __init__(self, core, methods_names) -> None:
+            self.core = core
+
+            def func_generator(method_name):
+
+                def func(*args, **kwargs):
+                    return self.core.execute_all_async("execute_engine", method_name, *args, **kwargs)
+
+                return func
+
+            for method_name in methods_names:
+                setattr(self, method_name, func_generator(method_name))
+
+    def __init__(self, name_prefix, process_dispatch_scheme, use_gpu, engine_type, *args, **kwargs) -> None:
+        from torch import nn
+        # print(f"in DataParallelEngineWrapper, name_prefix = {name_prefix}")
+        if isinstance(process_dispatch_scheme, RayResourcePool):
+            rpdc = process_dispatch_scheme
+        else:
+            rpdc = RayResourcePool(process_on_nodes=process_dispatch_scheme,
+                                   use_gpu=use_gpu,
+                                   name_prefix=name_prefix,
+                                   max_colocate_count=1)
+        rcia = RayClassWithInitArgs(cls=engine_type, *args, **kwargs)
+
+        self._engine_type = engine_type
+
+        super().__init__(rpdc, rcia)
+
+        nn_module_methods = [
+            method_name for method_name in dir(nn.Module)
+            if callable(getattr(nn.Module, method_name)) and not method_name.startswith("__")
+        ]
+        nn_module_methods += ["__call__"]
+
+        def func_generator(method_name):
+
+            def func(*args, **kwargs):
+                return self.execute_all_async(method_name, *args, **kwargs)
+
+            return func
+
+        print(f"{engine_type} has methods: {dir(engine_type)}")
+        for method_name in dir(engine_type):
+            try:
+                is_callable = callable(getattr(engine_type, method_name))
+            except Exception as _:
+                pass
+            else:
+                if is_callable and method_name not in dir(RefBasicRayActor):
+                    print(f"register method: {method_name}")
+                    setattr(self, method_name, func_generator(method_name))
+
+        self.module = DPEngineRayWorkerGroup.DummyModule(self, nn_module_methods)
+
+    @property
+    def engine(self):
+        return self.module
+
+    def get_model_size_on_rank_zero(self):
+        results = ray.get([worker.get_model_size_on_rank_zero.remote() for worker in self._workers])
+
+        for result in results:
+            if result is not None:
+                return result
--- a/single_controller/ray/megatron.py
+++ b/single_controller/ray/megatron.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Dict, Optional
+
+import ray
+
+from .base import RayWorkerGroup, RayResourcePool, RayClassWithInitArgs
+from single_controller.base.megatron.worker import DistRankInfo, DistGlobalInfo
+from single_controller.base.megatron.worker_group import MegatronWorkerGroup
+
+
+# NOTE(sgm): for opensource megatron-core
+class NVMegatronRayWorkerGroup(RayWorkerGroup, MegatronWorkerGroup):
+    """
+    MegatronWorkerGroup will query each worker of its megatron rank info and store it inside the WorkerGroup
+    so that the dispatcher can use it to dispatch data.
+    """
+
+    def __init__(self, resource_pool: RayResourcePool, ray_cls_with_init: RayClassWithInitArgs, **kwargs):
+        super().__init__(resource_pool=resource_pool, ray_cls_with_init=ray_cls_with_init, **kwargs)
+        self._megatron_rank_info: DistRankInfo = self.execute_all_sync(method_name='get_megatron_rank_info')
+        self._megatron_global_info: DistGlobalInfo = ray.get(
+            self.execute_rank_zero_async(method_name='get_megatron_global_info'))
+
+
+class MegatronRayWorkerGroup(RayWorkerGroup, MegatronWorkerGroup):
+    """
+    MegatronWorkerGroup will query each worker of its megatron rank info and store it inside the WorkerGroup
+    so that the dispatcher can use it to dispatch data.
+    """
+
+    def __init__(self,
+                 resource_pool: RayResourcePool,
+                 ray_cls_with_init: RayClassWithInitArgs,
+                 default_megatron_kwargs: Dict = None,
+                 **kwargs):
+        super().__init__(resource_pool=resource_pool,
+                         ray_cls_with_init=ray_cls_with_init,
+                         default_megatron_kwargs=default_megatron_kwargs,
+                         **kwargs)
+        self.init_megatron(default_megatron_kwargs=default_megatron_kwargs)
+        self._megatron_rank_info: DistRankInfo = self.execute_all_sync(method_name='get_megatron_rank_info')
+        self._megatron_global_info: DistGlobalInfo = ray.get(
+            self.execute_rank_zero_async(method_name='get_megatron_global_info'))
+
+    def init_megatron(self, default_megatron_kwargs: Optional[Dict] = None):
+        # after super, we will call init of each worker
+        if not self._is_init_with_detached_workers:
+            # only init_megatron if the WorkerGroup is created from scratch
+            self.execute_all_sync(method_name='init_megatron', default_megatron_kwargs=default_megatron_kwargs)
--- a/single_controller/version/version
+++ b/single_controller/version/version
+0.0.2
\ No newline at end of file
--- a/verl/__init__.py
+++ b/verl/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+version_folder = os.path.dirname(os.path.join(os.path.abspath(__file__)))
+
+with open(os.path.join(version_folder, 'version/version')) as f:
+    __version__ = f.read().strip()
+
+from .protocol import DataProto
+
+from .utils.logging_utils import set_basic_config
+import logging
+
+set_basic_config(level=logging.WARNING)
--- a/verl/models/README.md
+++ b/verl/models/README.md
+# Models
+Common modelzoo such as huggingface/transformers stuggles when using Pytorch native model parallelism. Following the design principle of vLLM, we keep a simple, parallelizable, highly-optimized with packed inputs in verl. 
+## Adding a New Huggingface Model
+### Step 1: Copy the model file from HF to verl
+- Add a new file under verl/models/hf
+- Copy ONLY the model file from huggingface/transformers/models to verl/models/hf
+
+### Step 2: Modify the model file to use packed inputs
+- Remove all the code related to inference (kv cache)
+- Modify the inputs to include only
+    - input_ids (total_nnz,)
+    - cu_seqlens (total_nnz + 1,)
+    - max_seqlen_in_batch: int
+- Note that this requires using flash attention with causal mask.
+
+### Step 2.5: Add tests
+- Add a test to compare this version and the huggingface version
+- Following the infrastructure and add tests to tests/models/hf
+
+### Step 3: Add a function to apply tensor parallelism
+- Please follow
+    - https://pytorch.org/docs/stable/distributed.tensor.parallel.html
+    - https://pytorch.org/tutorials/intermediate/TP_tutorial.html
+- General comments
+    - Tensor Parallelism in native Pytorch is NOT auto-parallelism. The way it works is to specify how model parameters and input/output reshards using configs. These configs are then registered as hooks to perform input/output resharding before/after model forward.
+
+### Step 4: Add a function to apply data parallelism
+- Please use FSDP2 APIs
+- See demo here https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallelize_llama.py#L413
+
+### Step 5: Add a function to apply pipeline parallelism
+- Comes in Pytorch 2.4
+- Currently only in alpha in nightly version
+- Check torchtitan for more details
+
--- a/verl/models/__init__.py
+++ b/verl/models/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/verl/models/llama/megatron/__init__.py
+++ b/verl/models/llama/megatron/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .modeling_llama_megatron import (
+    # original model with megatron
+    ParallelLlamaModel,
+    ParallelLlamaForCausalLM,
+    # rmpad with megatron
+    ParallelLlamaForCausalLMRmPad,
+    ParallelLlamaForValueRmPad,
+    # rmpad with megatron and pipeline parallelism
+    ParallelLlamaForCausalLMRmPadPP,
+    ParallelLlamaForValueRmPadPP)
--- a/verl/models/llama/megatron/checkpoint_utils/__init__.py
+++ b/verl/models/llama/megatron/checkpoint_utils/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/verl/models/llama/megatron/checkpoint_utils/llama_loader.py
+++ b/verl/models/llama/megatron/checkpoint_utils/llama_loader.py
--- a/verl/models/llama/megatron/checkpoint_utils/llama_saver.py
+++ b/verl/models/llama/megatron/checkpoint_utils/llama_saver.py
--- a/verl/models/llama/megatron/layers/__init__.py
+++ b/verl/models/llama/megatron/layers/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .parallel_attention import ParallelLlamaAttention
+from .parallel_decoder import ParallelLlamaDecoderLayer, ParallelLlamaDecoderLayerRmPad
+from .parallel_mlp import ParallelLlamaMLP
+from .parallel_rmsnorm import ParallelLlamaRMSNorm
--- a/verl/models/llama/megatron/layers/parallel_attention.py
+++ b/verl/models/llama/megatron/layers/parallel_attention.py
--- a/verl/models/llama/megatron/layers/parallel_decoder.py
+++ b/verl/models/llama/megatron/layers/parallel_decoder.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional, Tuple
+
+import torch
+from torch import nn
+from transformers import LlamaConfig
+from megatron.core import ModelParallelConfig
+
+from .parallel_attention import ParallelLlamaAttention, ParallelLlamaAttentionRmPad
+from .parallel_mlp import ParallelLlamaMLP
+from .parallel_rmsnorm import ParallelLlamaRMSNorm
+
+
+class ParallelLlamaDecoderLayer(nn.Module):
+
+    def __init__(self, config: LlamaConfig, megatron_config: ModelParallelConfig):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+        self.self_attn = ParallelLlamaAttention(config=config, megatron_config=megatron_config)
+
+        self.mlp = ParallelLlamaMLP(config, megatron_config=megatron_config)
+        self.input_layernorm = ParallelLlamaRMSNorm(config, megatron_config)
+        self.post_attention_layernorm = ParallelLlamaRMSNorm(config, megatron_config)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
+                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+        """
+
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Note: sequence parallel is hidden inside ColumnParallelLinear
+        # reduce scatter is hidden inside RowParallelLinear
+
+        # Self Attention
+        hidden_states = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+        )
+
+        # TODO: add sequence parallel operator reduce_scatter here
+
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+
+        # TODO: add sequence parallel operator all_gather here
+
+        hidden_states = self.mlp(hidden_states)
+
+        # TODO: add sequence parallel operator reduce_scatter here
+
+        hidden_states = residual + hidden_states
+
+        outputs = hidden_states
+
+        return outputs
+
+
+class ParallelLlamaDecoderLayerRmPad(nn.Module):
+
+    def __init__(self, config: LlamaConfig, megatron_config: ModelParallelConfig):
+        super().__init__()
+        self.config = config
+        self.megatron_config = megatron_config
+        self.hidden_size = config.hidden_size
+        self.self_attn = ParallelLlamaAttentionRmPad(config=config, megatron_config=megatron_config)
+
+        self.mlp = ParallelLlamaMLP(config, megatron_config=megatron_config)
+        self.input_layernorm = ParallelLlamaRMSNorm(config, megatron_config)
+        self.post_attention_layernorm = ParallelLlamaRMSNorm(config, megatron_config)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_ids: Optional[torch.LongTensor] = None,
+        sequence_length: int = None,
+        indices: torch.Tensor = None,
+        cu_seqlens: int = None,
+        max_seqlen_in_batch: int = None
+    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
+        residual = hidden_states  # (total_nnz // sp, 1, hidden_size)
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # Self Attention
+        # (total_nnz // sp, 1, hidden_size) -> all-gather (total_nnz, 1, hidden_size)
+        # -> col + row -> reduce-scatter -> (total_nnz // sp, 1, hidden_size)
+        hidden_states = self.self_attn(hidden_states=hidden_states,
+                                       position_ids=position_ids,
+                                       sequence_length=sequence_length,
+                                       indices=indices,
+                                       cu_seqlens=cu_seqlens,
+                                       max_seqlen_in_batch=max_seqlen_in_batch)
+
+        hidden_states = residual + hidden_states
+
+        # Fully Connected
+        # shape changes same as attn
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = hidden_states
+
+        return outputs
--- a/verl/models/llama/megatron/layers/parallel_linear.py
+++ b/verl/models/llama/megatron/layers/parallel_linear.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2023 The vLLM team. 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Adapted from https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/linear.py
+
+from typing import Optional, Tuple
+
+from megatron.core import tensor_parallel
+
+
+class QKVParallelLinear(tensor_parallel.ColumnParallelLinear):
+
+    def __init__(self,
+                 input_size,
+                 num_heads,
+                 num_key_value_heads,
+                 head_dim,
+                 *,
+                 bias=True,
+                 gather_output=True,
+                 skip_bias_add=False,
+                 **kwargs):
+        # Keep input parameters, and already restrict the head numbers
+        self.input_size = input_size
+        self.q_output_size = num_heads * head_dim
+        self.kv_output_size = num_key_value_heads * head_dim
+        self.head_dim = head_dim
+        self.gather_output = gather_output
+        self.skip_bias_add = skip_bias_add
+
+        input_size = self.input_size
+        output_size = (num_heads + 2 * num_key_value_heads) * self.head_dim
+
+        super().__init__(input_size=input_size,
+                         output_size=output_size,
+                         bias=bias,
+                         gather_output=gather_output,
+                         skip_bias_add=skip_bias_add,
+                         **kwargs)
+
+
+class MergedColumnParallelLinear(tensor_parallel.ColumnParallelLinear):
+
+    def __init__(self,
+                 input_size,
+                 gate_ouput_size,
+                 up_output_size,
+                 *,
+                 bias=True,
+                 gather_output=True,
+                 skip_bias_add=False,
+                 **kwargs):
+        # Keep input parameters, and already restrict the head numbers
+        self.input_size = input_size
+        self.output_size = gate_ouput_size + up_output_size
+        self.gather_output = gather_output
+        self.skip_bias_add = skip_bias_add
+
+        super().__init__(input_size=self.input_size,
+                         output_size=self.output_size,
+                         bias=bias,
+                         gather_output=gather_output,
+                         skip_bias_add=skip_bias_add,
+                         **kwargs)
--- a/verl/models/llama/megatron/layers/parallel_mlp.py
+++ b/verl/models/llama/megatron/layers/parallel_mlp.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from megatron.core import parallel_state as mpu
+from megatron.core import tensor_parallel
+from megatron.core import ModelParallelConfig
+from torch import nn
+from transformers.activations import ACT2FN
+from verl.models.llama.megatron.layers.parallel_linear import MergedColumnParallelLinear
+
+from verl.utils.megatron import tensor_parallel as tp_utils
+
+
+class ParallelLlamaMLP(nn.Module):
+
+    def __init__(self, config, megatron_config: ModelParallelConfig = None) -> None:
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        # The weight is only [hidden_size, intermediate_size // model_parallel_world_size]
+
+        column_kwargs = tp_utils.get_default_kwargs_for_column_parallel_linear()
+        row_kwargs = tp_utils.get_default_kwargs_for_row_parallel_linear()
+
+        if megatron_config is not None:
+            assert column_kwargs.get('config', False), 'must have ModelParallelConfig'
+            assert row_kwargs.get('config', False), 'must have ModelParallelConfig'
+            tp_utils.update_kwargs_with_config(row_kwargs, megatron_config)
+            tp_utils.update_kwargs_with_config(column_kwargs, megatron_config)
+
+        tp_size = mpu.get_tensor_model_parallel_world_size()
+
+        self.gate_up_proj = MergedColumnParallelLinear(
+            input_size=self.hidden_size,
+            gate_ouput_size=self.intermediate_size,
+            up_output_size=self.intermediate_size,
+            bias=False,
+            gather_output=False,
+            skip_bias_add=False,
+            **column_kwargs,
+        )
+        self.gate_size = self.intermediate_size // tp_size
+
+        self.down_proj = tensor_parallel.RowParallelLinear(input_size=self.intermediate_size,
+                                                           output_size=self.hidden_size,
+                                                           bias=False,
+                                                           input_is_parallel=True,
+                                                           skip_bias_add=False,
+                                                           **row_kwargs)
+
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        gate_up = self.gate_up_proj(x)[0]
+        gate, up = gate_up.split(self.gate_size, dim=-1)
+        return self.down_proj(self.act_fn(gate) * up)[0]
--- a/verl/models/llama/megatron/layers/parallel_rmsnorm.py
+++ b/verl/models/llama/megatron/layers/parallel_rmsnorm.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numbers
+import torch
+from megatron.core import ModelParallelConfig
+from torch import nn
+from transformers import LlamaConfig
+
+from apex.normalization.fused_layer_norm import fused_rms_norm_affine
+from verl.utils.megatron import sequence_parallel as sp_utils
+
+
+class ParallelLlamaRMSNorm(nn.Module):
+
+    def __init__(self, config: LlamaConfig, megatron_config: ModelParallelConfig):
+        """
+        LlamaRMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        if isinstance(config.hidden_size, numbers.Integral):
+            normalized_shape = (config.hidden_size,)
+        self.normalized_shape = torch.Size(normalized_shape)
+        self.weight = nn.Parameter(torch.ones(self.normalized_shape))
+        self.variance_epsilon = config.rms_norm_eps
+
+        if megatron_config.sequence_parallel:
+            sp_utils.mark_parameter_as_sequence_parallel(self.weight)
+
+    def forward(self, hidden_states):
+        return fused_rms_norm_affine(input=hidden_states,
+                                     weight=self.weight,
+                                     normalized_shape=self.normalized_shape,
+                                     eps=self.variance_epsilon,
+                                     memory_efficient=True)
\ No newline at end of file
--- a/verl/models/llama/megatron/modeling_llama_megatron.py
+++ b/verl/models/llama/megatron/modeling_llama_megatron.py
--- a/verl/models/registry.py
+++ b/verl/models/registry.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+from typing import List, Optional, Type
+
+import torch.nn as nn
+
+# Architecture -> (module, class).
+_MODELS = {
+    "LlamaForCausalLM":
+        ("llama", ("ParallelLlamaForCausalLMRmPadPP", "ParallelLlamaForValueRmPadPP", "ParallelLlamaForCausalLMRmPad")),
+    "MistralForCausalLM": ("mistral", ("ParallelMistralForCausalLMRmPadPP", "ParallelMistralForValueRmPadPP",
+                                       "ParallelMistralForCausalLMRmPad"))
+}
+
+
+# return model class
+class ModelRegistry:
+
+    @staticmethod
+    def load_model_cls(model_arch: str, value=False) -> Optional[Type[nn.Module]]:
+        if model_arch not in _MODELS:
+            return None
+
+        megatron = "megatron"
+
+        module_name, model_cls_name = _MODELS[model_arch]
+        if not value:  # actor/ref
+            model_cls_name = model_cls_name[0]
+        elif value:  # critic/rm
+            model_cls_name = model_cls_name[1]
+
+        module = importlib.import_module(f"verl.models.{module_name}.{megatron}.modeling_{module_name}_megatron")
+        return getattr(module, model_cls_name, None)
+
+    @staticmethod
+    def get_supported_archs() -> List[str]:
+        return list(_MODELS.keys())
--- a/verl/models/weight_loader_registry.py
+++ b/verl/models/weight_loader_registry.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+def get_weight_loader(arch: str):
+    from verl.models.llama.megatron.checkpoint_utils.llama_loader import load_state_dict_to_megatron_llama
+    _MODEL_WEIGHT_MEGATRON_LOADER_REGISTRY = {'LlamaForCausalLM': load_state_dict_to_megatron_llama}
+
+    if arch in _MODEL_WEIGHT_MEGATRON_LOADER_REGISTRY:
+        return _MODEL_WEIGHT_MEGATRON_LOADER_REGISTRY[arch]
+    raise ValueError(f"Model architectures {arch} are not supported for now. "
+                     f"Supported architectures: {_MODEL_WEIGHT_MEGATRON_LOADER_REGISTRY.keys()}")
--- a/verl/protocol.py
+++ b/verl/protocol.py
--- a/verl/third_party/__init__.py
+++ b/verl/third_party/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/verl/third_party/vllm/__init__.py
+++ b/verl/third_party/vllm/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from importlib.metadata import version, PackageNotFoundError
+
+
+def get_version(pkg):
+    try:
+        return version(pkg)
+    except PackageNotFoundError:
+        return None
+
+
+package_name = 'vllm'
+package_version = get_version(package_name)
+
+if package_version == '0.3.1':
+    vllm_version = '0.3.1'
+    from .vllm_v_0_3_1.llm import LLM
+    from .vllm_v_0_3_1.llm import LLMEngine
+    from .vllm_v_0_3_1 import parallel_state
+elif package_version == '0.4.2':
+    vllm_version = '0.4.2'
+    from .vllm_v_0_4_2.llm import LLM
+    from .vllm_v_0_4_2.llm import LLMEngine
+    from .vllm_v_0_4_2 import parallel_state
+elif package_version == '0.5.4':
+    vllm_version = '0.5.4'
+    from .vllm_v_0_5_4.llm import LLM
+    from .vllm_v_0_5_4.llm import LLMEngine
+    from .vllm_v_0_5_4 import parallel_state
+else:
+    raise ValueError(
+        f'vllm version {package_version} not supported. Currently supported versions are 0.3.1, 0.4.2, and 0.5.4.')
--- a/verl/third_party/vllm/vllm_v_0_3_1/__init__.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/verl/third_party/vllm/vllm_v_0_3_1/arg_utils.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/arg_utils.py
--- a/verl/third_party/vllm/vllm_v_0_3_1/config.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/config.py
--- a/verl/third_party/vllm/vllm_v_0_3_1/llm.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/llm.py
--- a/verl/third_party/vllm/vllm_v_0_3_1/llm_engine_sp.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/llm_engine_sp.py
--- a/verl/third_party/vllm/vllm_v_0_3_1/model_loader.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/model_loader.py
--- a/verl/third_party/vllm/vllm_v_0_3_1/model_runner.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/model_runner.py
--- a/verl/third_party/vllm/vllm_v_0_3_1/parallel_state.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/parallel_state.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2023 The vLLM team.
+# Adapted from
+# https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/parallel_state.py
+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+"""Model and data parallel groups."""
+
+import torch
+import torch.distributed
+
+import vllm.model_executor.parallel_utils.parallel_state as ps
+"""
+This version is strongly tied with Megatron to implement HybridEngine and weight sharing between vllm and Megatron.
+- We assume the Megatron tp+dp+pp world is already established before calling this function.
+
+"""
+
+# Tensor model parallel group that the current rank belongs to.
+_TENSOR_MODEL_PARALLEL_GROUP = None
+
+# Micro Data parallel group. Micro data parallel group is additional dp group that origins from splitting training tp
+# into infer_tp and micro_tp. By default, we use order micro_dp - tp
+_MICRO_DATA_PARALLEL_GROUP = None
+
+
+def initialize_model_parallel_from_megatron(
+        tensor_model_parallel_size=None  # we set None for backward compatibility to set infer_tp = train_tp
+) -> None:
+    from megatron.core import parallel_state as mpu
+    from megatron.distributed import new_group
+    # Get world size and rank. Ensure some consistencies.
+    assert torch.distributed.is_initialized()
+
+    if tensor_model_parallel_size is None:
+        tensor_model_parallel_size = mpu.get_tensor_model_parallel_world_size()
+    else:
+        assert isinstance(tensor_model_parallel_size, int)
+
+    # Build the tensor model-parallel groups.
+    assert ps._TENSOR_MODEL_PARALLEL_GROUP is None, ("tensor model parallel group is already initialized")
+
+    assert tensor_model_parallel_size <= mpu.get_tensor_model_parallel_world_size(
+    ), 'Not implemented for infer_tp > train_tp'
+
+    global _TENSOR_MODEL_PARALLEL_GROUP
+    global _MICRO_DATA_PARALLEL_GROUP
+
+    assert mpu.get_tensor_model_parallel_world_size() % tensor_model_parallel_size == 0
+
+    micro_dp_size = mpu.get_tensor_model_parallel_world_size() // tensor_model_parallel_size
+
+    world_size: int = torch.distributed.get_world_size()
+
+    num_micro_dp_groups = world_size // micro_dp_size
+
+    rank = torch.distributed.get_rank()
+
+    # Build the micro dp groups.
+    assert _MICRO_DATA_PARALLEL_GROUP is None, ("micro data parallel group is already initialized")
+    for i in range(num_micro_dp_groups):
+        ranks = range(i * micro_dp_size, (i + 1) * micro_dp_size)
+        group = new_group(rank=rank, ranks=ranks, group_type='micro_dp')
+        if rank in ranks:
+            _MICRO_DATA_PARALLEL_GROUP = group
+
+    if tensor_model_parallel_size == mpu.get_tensor_model_parallel_world_size():
+        # using the same tp group as Megatron
+        ps._TENSOR_MODEL_PARALLEL_GROUP = mpu.get_tensor_model_parallel_group()
+
+        _TENSOR_MODEL_PARALLEL_GROUP = mpu.get_tensor_model_parallel_group()
+        # no _MICRO_DATA_PARALLEL_GROUP
+    else:
+        # initialize a micro_dp group and a tp group
+        # assume training tp=4, infer tp=2, then, weight is partitioned as
+        # [1], [2], [3], [4] for training and [1,2], [1,2], [3,4], [3,4] for inference
+
+        # Build the inference tp groups
+        train_tp = mpu.get_tensor_model_parallel_world_size()
+        num_tensor_model_parallel_groups_per_train_tp = train_tp // tensor_model_parallel_size
+        num_tensor_model_parallel_groups = world_size // tensor_model_parallel_size
+        assert _TENSOR_MODEL_PARALLEL_GROUP is None, ("tensor model parallel group is already initialized")
+        for i in range(num_tensor_model_parallel_groups // num_tensor_model_parallel_groups_per_train_tp):
+            start = train_tp * i
+            end = train_tp * (i + 1)
+            for j in range(num_tensor_model_parallel_groups_per_train_tp):
+                ranks = list(range(start, end, num_tensor_model_parallel_groups_per_train_tp))
+                for i in range(len(ranks)):
+                    ranks[i] += j
+                # group = torch.distributed.new_group(ranks)
+                group = new_group(rank=rank, ranks=ranks, group_type='infer_tp')
+                if rank in ranks:
+                    _TENSOR_MODEL_PARALLEL_GROUP = group
+                    ps._TENSOR_MODEL_PARALLEL_GROUP = _TENSOR_MODEL_PARALLEL_GROUP
+    # Build the pipeline model-parallel groups.
+    # global _PIPELINE_MODEL_PARALLEL_GROUP
+    # global _PIPELINE_GLOBAL_RANKS
+    # assert ps._PIPELINE_MODEL_PARALLEL_GROUP is None, ("pipeline model parallel group is already initialized")
+
+    # ps._PIPELINE_MODEL_PARALLEL_GROUP = mpu.get_pipeline_model_parallel_group()
+    # ps._PIPELINE_GLOBAL_RANKS = mpu.get_pipeline_model_parallel_ranks()
+
+
+"""
+Tensor model parallel utilities
+"""
+
+
+def get_tensor_model_parallel_group():
+    """Get the tensor model parallel group the caller rank belongs to."""
+    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, ("tensor model parallel group is not initialized")
+    return _TENSOR_MODEL_PARALLEL_GROUP
+
+
+def get_tensor_model_parallel_world_size():
+    """Return world size for the tensor model parallel group."""
+    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
+
+
+def get_tensor_model_parallel_rank():
+    """Return my rank for the tensor model parallel group."""
+    return torch.distributed.get_rank(group=get_tensor_model_parallel_group())
+
+
+def get_tensor_model_parallel_src_rank():
+    """Calculate the global rank corresponding to the first local rank
+    in the tensor model parallel group."""
+    global_rank = torch.distributed.get_rank()
+    local_world_size = get_tensor_model_parallel_world_size()
+    return (global_rank // local_world_size) * local_world_size
+
+
+"""
+Micro Data parallel group
+"""
+
+
+def get_micro_data_parallel_group():
+    assert _MICRO_DATA_PARALLEL_GROUP is not None
+    return _MICRO_DATA_PARALLEL_GROUP
+
+
+def get_micro_data_parallel_world_size():
+    return torch.distributed.get_world_size(group=get_micro_data_parallel_group())
+
+
+def get_micro_data_parallel_rank():
+    return torch.distributed.get_rank(group=get_micro_data_parallel_group())
--- a/verl/third_party/vllm/vllm_v_0_3_1/tokenizer.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/tokenizer.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2023 The vLLM team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Adapted from https://github.com/vllm-project/vllm/blob/main/vllm/transformers_utils/tokenizer_group/tokenizer_group.py
+
+from typing import List, Optional, Tuple, Union
+
+from transformers import (AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast)
+
+from vllm.lora.request import LoRARequest
+from vllm.utils import make_async, LRUCache
+from vllm.transformers_utils.tokenizers import *
+
+
+class TokenizerGroup:
+    """A group of tokenizers that can be used for LoRA adapters."""
+
+    def __init__(self, tokenizer: PreTrainedTokenizer, enable_lora: bool, max_num_seqs: int,
+                 max_input_length: Optional[int]):
+        self.enable_lora = enable_lora
+        self.max_input_length = max_input_length
+        self.tokenizer = tokenizer
+        if enable_lora:
+            self.lora_tokenizers = LRUCache(capacity=max_num_seqs)
+        else:
+            self.lora_tokenizers = None
+
+    def encode(self,
+               prompt: str,
+               request_id: Optional[str] = None,
+               lora_request: Optional[LoRARequest] = None) -> List[int]:
+        tokenizer = self.get_lora_tokenizer(lora_request)
+        return tokenizer.encode(prompt)
+
+    async def encode_async(self,
+                           prompt: str,
+                           request_id: Optional[str] = None,
+                           lora_request: Optional[LoRARequest] = None) -> List[int]:
+        tokenizer = await self.get_lora_tokenizer_async(lora_request)
+        return tokenizer.encode(prompt)
+
+    def get_lora_tokenizer(self, lora_request: Optional[LoRARequest]) -> "PreTrainedTokenizer":
+        if not lora_request or not self.enable_lora:
+            return self.tokenizer
+        if lora_request.lora_int_id not in self.lora_tokenizers:
+            # TODO(sgm): the lora tokenizer is also passed, but may be different
+            tokenizer = self.tokenizer
+            # tokenizer = (get_lora_tokenizer(
+            #     lora_request, **self.tokenizer_config) or self.tokenizer)
+            self.lora_tokenizers.put(lora_request.lora_int_id, tokenizer)
+            return tokenizer
+        else:
+            return self.lora_tokenizers.get(lora_request.lora_int_id)
+
+    # FIXME(sgm): for simplicity, we assign the special token here
+    @property
+    def pad_token_id(self):
+        return self.tokenizer.pad_token_id
+
+    @property
+    def eos_token_id(self):
+        return self.tokenizer.eos_token_id
--- a/verl/third_party/vllm/vllm_v_0_3_1/weight_loaders.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/weight_loaders.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2023 The vLLM team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Adapted from https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models
+
+from typing import Dict
+import torch
+import torch.nn as nn
+
+
+# NOTE(shengguangming): replace the origin weight loader function in the class
+def parallel_weight_loader(self, param: torch.Tensor, loaded_weight: torch.Tensor) -> None:
+    """Parallel Linear weight loader."""
+    assert param.size() == loaded_weight.size(
+    ), 'the parameter size is not align with the loaded weight size, param size: {}, loaded_weight size: {}'.format(
+        param.size(), loaded_weight.size())
+    assert param.data.dtype == loaded_weight.data.dtype, "if we want to shared weights, the data type should also be the same"
+
+    param.data = loaded_weight.data
+
+
+def default_weight_loader(param: torch.Tensor, loaded_weight: torch.Tensor) -> None:
+    """Default weight loader."""
+    assert param.size() == loaded_weight.size()
+    assert param.data.dtype == loaded_weight.data.dtype, "if we want to shared weights, the data type should also be the same"
+
+    param.data = loaded_weight.data
+
+
+def gpt2_weight_loader(actor_weights: Dict, vllm_model: nn.Module) -> nn.Module:
+    params_dict = dict(vllm_model.named_parameters(remove_duplicate=False))
+    for name, loaded_weight in actor_weights.items():
+        if "lm_head.weight" in name:
+            # GPT-2 ties the weights of the embedding layer and the final
+            # linear layer.
+            continue
+        if ".attn.bias" in name or ".attn.masked_bias" in name:
+            # Skip attention mask.
+            # NOTE: "c_attn.bias" should not be skipped.
+            continue
+        if not name.startswith("transformer."):
+            name = "transformer." + name
+        param = params_dict[name]
+        # The HF's GPT-2 implementation uses Conv1D instead of Linear.
+        # Because of this, we need to transpose the weights.
+        # Note(zhuohan): the logic below might break quantized models.
+        for conv1d_weight_name in ["c_attn", "c_proj", "c_fc"]:
+            if conv1d_weight_name not in name:
+                continue
+            if not name.endswith(".weight"):
+                continue
+            # TODO: check megatron
+            loaded_weight = loaded_weight.t()
+        weight_loader = getattr(param, "weight_loader", default_weight_loader)
+        weight_loader(param, loaded_weight)
+
+
+def llama_weight_loader(actor_weights: Dict, vllm_model: nn.Module) -> nn.Module:
+    # NOTE(shengguangming): the megatron llama may have this prefix
+    prefix = '0.module.module.'
+    params_dict = dict(vllm_model.named_parameters())
+    for name, loaded_weight in actor_weights.items():
+        if name[:len(prefix)] == prefix:
+            name = name[len(prefix):]
+        if "rotary_emb.inv_freq" in name:
+            continue
+        else:
+            param = params_dict[name]
+            weight_loader = getattr(param, "weight_loader", default_weight_loader)
+            weight_loader(param, loaded_weight)
+
+
+def mistral_weight_loader(actor_weights: Dict, vllm_model: nn.Module) -> nn.Module:
+    # TODO: need to implement a general way to deal with prefix
+    prefix = '0.module.module.'
+    params_dict = dict(vllm_model.named_parameters())
+    for name, loaded_weight in actor_weights.items():
+        if name[:len(prefix)] == prefix:
+            name = name[len(prefix):]
+        if "rotary_emb.inv_freq" in name:
+            continue
+        else:
+            param = params_dict[name]
+            weight_loader = getattr(param, "weight_loader", default_weight_loader)
+            weight_loader(param, loaded_weight)
--- a/verl/third_party/vllm/vllm_v_0_3_1/worker.py
+++ b/verl/third_party/vllm/vllm_v_0_3_1/worker.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/__init__.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/__init__.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/verl/third_party/vllm/vllm_v_0_4_2/arg_utils.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/arg_utils.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/config.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/config.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2023 The vLLM team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Adapted from https://github.com/vllm-project/vllm/blob/main/vllm/config.py
+
+
+import enum
+import json
+from typing import List, Optional, Union
+from dataclasses import dataclass, field, fields
+
+from transformers import PretrainedConfig
+
+from vllm.logger import init_logger
+from vllm.model_executor.layers.quantization import get_quantization_config
+from vllm.transformers_utils.config import get_hf_text_config
+from vllm.utils import is_hip
+# Add for verl
+from vllm.config import ModelConfig, _get_and_verify_dtype, _get_and_verify_max_len
+
+GPTQMarlinConfig = get_quantization_config("gptq_marlin")
+
+logger = init_logger(__name__)
+
+_GB = 1 << 30
+
+
+class ModelConfig(ModelConfig):
+    """Configuration for the model.
+
+    Args:
+        model: Name or path of the huggingface model to use.
+        tokenizer: Name or path of the huggingface tokenizer to use.
+        tokenizer_mode: Tokenizer mode. "auto" will use the fast tokenizer if
+            available, and "slow" will always use the slow tokenizer.
+        trust_remote_code: Trust remote code (e.g., from HuggingFace) when
+            downloading the model and tokenizer.
+        download_dir: Directory to download and load the weights, default to the
+            default cache directory of huggingface.
+        load_format: The format of the model weights to load:
+            "auto" will try to load the weights in the safetensors format and
+                fall back to the pytorch bin format if safetensors format is
+                not available.
+            "pt" will load the weights in the pytorch bin format.
+            "safetensors" will load the weights in the safetensors format.
+            "npcache" will load the weights in pytorch format and store
+                a numpy cache to speed up the loading.
+            "dummy" will initialize the weights with random values, which is
+                mainly for profiling.
+        dtype: Data type for model weights and activations. The "auto" option
+            will use FP16 precision for FP32 and FP16 models, and BF16 precision
+            for BF16 models.
+        seed: Random seed for reproducibility.
+        revision: The specific model version to use. It can be a branch name,
+            a tag name, or a commit id. If unspecified, will use the default
+            version.
+        code_revision: The specific revision to use for the model code on
+            Hugging Face Hub. It can be a branch name, a tag name, or a
+            commit id. If unspecified, will use the default version.
+        tokenizer_revision: The specific tokenizer version to use. It can be a
+            branch name, a tag name, or a commit id. If unspecified, will use
+            the default version.
+        max_model_len: Maximum length of a sequence (including prompt and
+            output). If None, will be derived from the model.
+        quantization: Quantization method that was used to quantize the model
+            weights. If None, we assume the model weights are not quantized.
+        quantization_param_path: Path to JSON file containing scaling factors.
+            Used to load KV cache scaling factors into the model when KV cache
+            type is FP8_E4M3 on ROCm (AMD GPU). In the future these will also
+            be used to load activation and weight scaling factors when the
+            model dtype is FP8_E4M3 on ROCm.
+        enforce_eager: Whether to enforce eager execution. If True, we will
+            disable CUDA graph and always execute the model in eager mode.
+            If False, we will use CUDA graph and eager execution in hybrid.
+        max_context_len_to_capture: Maximum context len covered by CUDA graphs.
+            When a sequence has context length larger than this, we fall back
+            to eager mode (DEPRECATED. Use max_seq_len_to_capture instead).
+        max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.
+            When a sequence has context length larger than this, we fall back
+            to eager mode
+        skip_tokenizer_init: If true, skip initialization of tokenizer and
+            detokenizer.
+        served_model_name: The model name used in metrics tag `model_name`,
+            matches the model name exposed via the APIs. If multiple model 
+            names provided, the first name will be used. If not specified, 
+            the model name will be the same as `model`.
+    """
+
+    def __init__(
+        self,
+        hf_config: PretrainedConfig,
+        dtype: str,
+        seed: int,
+        revision: Optional[str] = None,
+        code_revision: Optional[str] = None,
+        tokenizer_revision: Optional[str] = None,
+        max_model_len: Optional[int] = None,
+        quantization: Optional[str] = None,
+        quantization_param_path: Optional[str] = None,
+        enforce_eager: bool = False,
+        max_context_len_to_capture: Optional[int] = None,
+        max_seq_len_to_capture: Optional[int] = None,
+        max_logprobs: int = 5,
+        skip_tokenizer_init: bool = False,
+        served_model_name: Optional[Union[str, List[str]]] = None,
+    ) -> None:
+        self.model = hf_config._name_or_path
+        self.tokenizer = hf_config._name_or_path
+        self.seed = seed
+        self.revision = revision
+        self.code_revision = code_revision
+        self.tokenizer_revision = tokenizer_revision
+        self.quantization = quantization
+        self.quantization_param_path = quantization_param_path
+        self.enforce_eager = enforce_eager
+        self.max_context_len_to_capture = max_context_len_to_capture
+        if self.max_context_len_to_capture is not None:
+            raise ValueError("`max_context_len_to_capture` is deprecated. "
+                             "Use `max_seq_len_to_capture` instead.")
+        self.max_seq_len_to_capture = (max_seq_len_to_capture or max_context_len_to_capture)
+        self.max_logprobs = max_logprobs
+        self.skip_tokenizer_init = skip_tokenizer_init
+
+        # self.hf_config = get_config(model, trust_remote_code, revision)
+        self.hf_config = hf_config
+        self.hf_text_config = get_hf_text_config(hf_config)
+        # TODO: for multimodal model
+        self.dtype = _get_and_verify_dtype(self.hf_config, dtype)
+        self.max_model_len = _get_and_verify_max_len(self.hf_config, max_model_len)
+        # self.served_model_name = get_served_model_name(model,
+        #                                                served_model_name)
+        # self._verify_load_format()
+        # self._verify_tokenizer_mode()
+        self._verify_quantization()
+        self._verify_cuda_graph()
+
+
+class LoadFormat(str, enum.Enum):
+    AUTO = 'auto'
+    MEGATRON = "megatron"
+    HF = "hf"
+    DTENSOR = 'dtensor'
+    DUMMY_HF = 'dummy_hf'
+    DUMMY_MEGATRON = 'dummy_megatron'
+    DUMMY_DTENSOR = 'dummy_dtensor'
+
+
+@dataclass
+class LoadConfig:
+    """
+        download_dir: Directory to download and load the weights, default to the
+            default cache directory of huggingface.
+        load_format: The format of the model weights to load:
+            "auto" will try to load the weights in the safetensors format and
+                fall back to the pytorch bin format if safetensors format is
+                not available.
+            "pt" will load the weights in the pytorch bin format.
+            "safetensors" will load the weights in the safetensors format.
+            "npcache" will load the weights in pytorch format and store
+                a numpy cache to speed up the loading.
+            "dummy" will initialize the weights with random values, which is
+                mainly for profiling.
+            "tensorizer" will use CoreWeave's tensorizer library for
+                fast weight loading.
+    """
+
+    load_format: Union[str, LoadFormat, "BaseModelLoader"] = LoadFormat.AUTO
+    download_dir: Optional[str] = None
+    model_loader_extra_config: Optional[Union[str, dict]] = field(default_factory=dict)
+
+    def __post_init__(self):
+        model_loader_extra_config = self.model_loader_extra_config or {}
+        if isinstance(model_loader_extra_config, str):
+            self.model_loader_extra_config = json.loads(model_loader_extra_config)
+        self._verify_load_format()
+
+    def _verify_load_format(self) -> None:
+        if not isinstance(self.load_format, str):
+            return
+
+        load_format = self.load_format.lower()
+        self.load_format = LoadFormat(load_format)
+
+        rocm_not_supported_load_format: List[str] = []
+        if is_hip() and load_format in rocm_not_supported_load_format:
+            rocm_supported_load_format = [
+                f for f in LoadFormat.__members__ if (f not in rocm_not_supported_load_format)
+            ]
+            raise ValueError(f"load format '{load_format}' is not supported in ROCm. "
+                             f"Supported load formats are "
+                             f"{rocm_supported_load_format}")
--- a/verl/third_party/vllm/vllm_v_0_4_2/dtensor_weight_loaders.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/dtensor_weight_loaders.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/hf_weight_loader.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/hf_weight_loader.py
+# Copyright 2024 Bytedance Ltd. and/or its affiliates
+# Copyright 2023 The vLLM team.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Adapted from https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models
+
+from typing import Dict, Union, Optional, Iterable, Tuple
+
+import torch
+import torch.nn as nn
+
+from vllm.model_executor.model_loader.utils import set_default_torch_dtype
+from vllm.model_executor.model_loader.weight_utils import default_weight_loader
+
+
+def update_hf_weight_loader():
+    from vllm.model_executor.models.gemma import GemmaForCausalLM
+    GemmaForCausalLM.load_weights = gemma_load_weights
+
+
+def gemma_load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
+    stacked_params_mapping = [
+        # (param_name, shard_name, shard_id)
+        ("qkv_proj", "q_proj", "q"),
+        ("qkv_proj", "k_proj", "k"),
+        ("qkv_proj", "v_proj", "v"),
+        ("gate_up_proj", "gate_proj", 0),
+        ("gate_up_proj", "up_proj", 1),
+    ]
+    params_dict = dict(self.named_parameters())
+    loaded_params = set()
+    for name, loaded_weight in weights:
+        for (param_name, shard_name, shard_id) in stacked_params_mapping:
+            if shard_name not in name:
+                continue
+            name = name.replace(shard_name, param_name)
+            # Skip loading extra bias for GPTQ models.
+            if name.endswith(".bias") and name not in params_dict:
+                continue
+            param = params_dict[name]
+            weight_loader = param.weight_loader
+            weight_loader(param, loaded_weight, shard_id)
+            break
+        else:
+            # lm_head is not used in vllm as it is tied with embed_token.
+            # To prevent errors, skip loading lm_head.weight.
+            if "lm_head.weight" in name:
+                continue
+            # Skip loading extra bias for GPTQ models.
+            if name.endswith(".bias") and name not in params_dict:
+                continue
+            # GemmaRMSNorm is different from Llama's in that it multiplies
+            # (1 + weight) to the output, instead of just weight.
+            if "norm.weight" in name:
+                norm_weight = loaded_weight + 1.0  # prevent inplace modify actor weights
+                param = params_dict[name]
+                weight_loader = getattr(param, "weight_loader", default_weight_loader)
+                weight_loader(param, norm_weight)
+            else:
+                param = params_dict[name]
+                weight_loader = getattr(param, "weight_loader", default_weight_loader)
+                weight_loader(param, loaded_weight)
+        loaded_params.add(name)
+    unloaded_params = params_dict.keys() - loaded_params
+    if unloaded_params:
+        raise RuntimeError("Some weights are not initialized from checkpoints: "
+                           f"{unloaded_params}")
+
+
+def load_hf_weights(actor_weights: Dict, vllm_model: nn.Module):
+    assert isinstance(actor_weights, Dict)
+    with set_default_torch_dtype(next(vllm_model.parameters()).dtype):  # TODO
+        vllm_model.load_weights(actor_weights.items())
+    for _, module in vllm_model.named_modules():
+        quant_method = getattr(module, "quant_method", None)
+        if quant_method is not None:
+            quant_method.process_weights_after_loading(module)
+        # FIXME: Remove this after Mixtral is updated
+        # to use quant_method.
+        if hasattr(module, "process_weights_after_loading"):
+            module.process_weights_after_loading()
+    vllm_model = vllm_model.cuda()
--- a/verl/third_party/vllm/vllm_v_0_4_2/llm.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/llm.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/llm_engine_sp.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/llm_engine_sp.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/megatron_weight_loaders.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/megatron_weight_loaders.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/model_loader.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/model_loader.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/model_runner.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/model_runner.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/parallel_state.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/spmd_gpu_executor.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/spmd_gpu_executor.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/tokenizer.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/tokenizer.py
--- a/verl/third_party/vllm/vllm_v_0_4_2/worker.py
+++ b/verl/third_party/vllm/vllm_v_0_4_2/worker.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/__init__.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/__init__.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/arg_utils.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/arg_utils.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/config.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/config.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/dtensor_weight_loaders.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/dtensor_weight_loaders.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/hf_weight_loader.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/hf_weight_loader.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/llm.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/llm.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/llm_engine_sp.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/llm_engine_sp.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/megatron_weight_loaders.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/megatron_weight_loaders.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/model_loader.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/model_loader.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/model_runner.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/model_runner.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/parallel_state.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/spmd_gpu_executor.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/spmd_gpu_executor.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/tokenizer.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/tokenizer.py
--- a/verl/third_party/vllm/vllm_v_0_5_4/worker.py
+++ b/verl/third_party/vllm/vllm_v_0_5_4/worker.py
--- a/verl/trainer/__init__.py
+++ b/verl/trainer/__init__.py
--- a/verl/trainer/config/evaluation.yaml
+++ b/verl/trainer/config/evaluation.yaml
--- a/verl/trainer/config/generation.yaml
+++ b/verl/trainer/config/generation.yaml
--- a/verl/trainer/config/ppo_megatron_trainer.yaml
+++ b/verl/trainer/config/ppo_megatron_trainer.yaml
--- a/verl/trainer/config/ppo_trainer.yaml
+++ b/verl/trainer/config/ppo_trainer.yaml
--- a/verl/trainer/config/sft_trainer.yaml
+++ b/verl/trainer/config/sft_trainer.yaml
--- a/verl/trainer/fsdp_sft_trainer.py
+++ b/verl/trainer/fsdp_sft_trainer.py
--- a/verl/trainer/main_eval.py
+++ b/verl/trainer/main_eval.py
--- a/verl/trainer/main_generation.py
+++ b/verl/trainer/main_generation.py
--- a/verl/trainer/main_ppo.py
+++ b/verl/trainer/main_ppo.py
--- a/verl/trainer/ppo/__init__.py
+++ b/verl/trainer/ppo/__init__.py
--- a/verl/trainer/ppo/actor/__init__.py
+++ b/verl/trainer/ppo/actor/__init__.py
--- a/verl/trainer/ppo/actor/base.py
+++ b/verl/trainer/ppo/actor/base.py
--- a/verl/trainer/ppo/actor/dp_actor.py
+++ b/verl/trainer/ppo/actor/dp_actor.py
--- a/verl/trainer/ppo/actor/megatron_actor.py
+++ b/verl/trainer/ppo/actor/megatron_actor.py
--- a/verl/trainer/ppo/core_algos.py
+++ b/verl/trainer/ppo/core_algos.py
--- a/verl/trainer/ppo/critic/__init__.py
+++ b/verl/trainer/ppo/critic/__init__.py
--- a/verl/trainer/ppo/critic/base.py
+++ b/verl/trainer/ppo/critic/base.py
--- a/verl/trainer/ppo/critic/dp_critic.py
+++ b/verl/trainer/ppo/critic/dp_critic.py
--- a/verl/trainer/ppo/critic/megatron_critic.py
+++ b/verl/trainer/ppo/critic/megatron_critic.py
--- a/verl/trainer/ppo/hybrid_engine/__init__.py
+++ b/verl/trainer/ppo/hybrid_engine/__init__.py
--- a/verl/trainer/ppo/hybrid_engine/base.py
+++ b/verl/trainer/ppo/hybrid_engine/base.py
--- a/verl/trainer/ppo/hybrid_engine/fsdp_vllm.py
+++ b/verl/trainer/ppo/hybrid_engine/fsdp_vllm.py
--- a/verl/trainer/ppo/hybrid_engine/megatron_vllm.py
+++ b/verl/trainer/ppo/hybrid_engine/megatron_vllm.py
--- a/verl/trainer/ppo/ray_trainer.py
+++ b/verl/trainer/ppo/ray_trainer.py
--- a/verl/trainer/ppo/reward_model/__init__.py
+++ b/verl/trainer/ppo/reward_model/__init__.py
--- a/verl/trainer/ppo/reward_model/base.py
+++ b/verl/trainer/ppo/reward_model/base.py
--- a/verl/trainer/ppo/reward_model/megatron/__init__.py
+++ b/verl/trainer/ppo/reward_model/megatron/__init__.py
--- a/verl/trainer/ppo/reward_model/megatron/reward_model.py
+++ b/verl/trainer/ppo/reward_model/megatron/reward_model.py
--- a/verl/trainer/ppo/rollout/__init__.py
+++ b/verl/trainer/ppo/rollout/__init__.py
--- a/verl/trainer/ppo/rollout/base.py
+++ b/verl/trainer/ppo/rollout/base.py
--- a/verl/trainer/ppo/rollout/hf_rollout.py
+++ b/verl/trainer/ppo/rollout/hf_rollout.py
--- a/verl/trainer/ppo/rollout/megatron/__init__.py
+++ b/verl/trainer/ppo/rollout/megatron/__init__.py
--- a/verl/trainer/ppo/rollout/naive/__init__.py
+++ b/verl/trainer/ppo/rollout/naive/__init__.py
--- a/verl/trainer/ppo/rollout/naive/naive_rollout.py
+++ b/verl/trainer/ppo/rollout/naive/naive_rollout.py
--- a/verl/trainer/ppo/rollout/tokenizer.py
+++ b/verl/trainer/ppo/rollout/tokenizer.py
--- a/verl/trainer/ppo/rollout/vllm_rollout/__init__.py
+++ b/verl/trainer/ppo/rollout/vllm_rollout/__init__.py
--- a/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py
+++ b/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py
--- a/verl/trainer/ppo/workers/__init__.py
+++ b/verl/trainer/ppo/workers/__init__.py
--- a/verl/trainer/ppo/workers/fsdp_workers.py
+++ b/verl/trainer/ppo/workers/fsdp_workers.py
--- a/verl/trainer/ppo/workers/megatron_workers.py
+++ b/verl/trainer/ppo/workers/megatron_workers.py
--- a/verl/trainer/runtime_env.yaml
+++ b/verl/trainer/runtime_env.yaml
--- a/verl/utils/__init__.py
+++ b/verl/utils/__init__.py
--- a/verl/utils/config.py
+++ b/verl/utils/config.py
--- a/verl/utils/dataset/README.md
+++ b/verl/utils/dataset/README.md
--- a/verl/utils/dataset/__init__.py
+++ b/verl/utils/dataset/__init__.py
--- a/verl/utils/dataset/rl_dataset.py
+++ b/verl/utils/dataset/rl_dataset.py
--- a/verl/utils/dataset/rm_dataset.py
+++ b/verl/utils/dataset/rm_dataset.py
--- a/verl/utils/dataset/sft_dataset.py
+++ b/verl/utils/dataset/sft_dataset.py
--- a/verl/utils/debug/__init__.py
+++ b/verl/utils/debug/__init__.py
--- a/verl/utils/debug/performance.py
+++ b/verl/utils/debug/performance.py
--- a/verl/utils/debug/trajectory_tracker.py
+++ b/verl/utils/debug/trajectory_tracker.py
--- a/verl/utils/distributed.py
+++ b/verl/utils/distributed.py
--- a/verl/utils/fs.py
+++ b/verl/utils/fs.py
--- a/verl/utils/fsdp_utils.py
+++ b/verl/utils/fsdp_utils.py
--- a/verl/utils/hdfs_io.py
+++ b/verl/utils/hdfs_io.py
--- a/verl/utils/import_utils.py
+++ b/verl/utils/import_utils.py
--- a/verl/utils/logger/__init__.py
+++ b/verl/utils/logger/__init__.py
--- a/verl/utils/logger/aggregate_logger.py
+++ b/verl/utils/logger/aggregate_logger.py
--- a/verl/utils/logging_utils.py
+++ b/verl/utils/logging_utils.py
--- a/verl/utils/megatron/__init__.py
+++ b/verl/utils/megatron/__init__.py
--- a/verl/utils/megatron/memory.py
+++ b/verl/utils/megatron/memory.py
--- a/verl/utils/megatron/optimizer.py
+++ b/verl/utils/megatron/optimizer.py
--- a/verl/utils/megatron/optimizer_config.py
+++ b/verl/utils/megatron/optimizer_config.py
--- a/verl/utils/megatron/pipeline_parallel.py
+++ b/verl/utils/megatron/pipeline_parallel.py
--- a/verl/utils/megatron/sequence_parallel.py
+++ b/verl/utils/megatron/sequence_parallel.py
--- a/verl/utils/megatron/tensor_parallel.py
+++ b/verl/utils/megatron/tensor_parallel.py
--- a/verl/utils/megatron_utils.py
+++ b/verl/utils/megatron_utils.py
--- a/verl/utils/memory_buffer.py
+++ b/verl/utils/memory_buffer.py
--- a/verl/utils/model.py
+++ b/verl/utils/model.py
--- a/verl/utils/py_functional.py
+++ b/verl/utils/py_functional.py
--- a/verl/utils/ray_utils.py
+++ b/verl/utils/ray_utils.py
--- a/verl/utils/rendezvous/__init__.py
+++ b/verl/utils/rendezvous/__init__.py
--- a/verl/utils/rendezvous/ray_backend.py
+++ b/verl/utils/rendezvous/ray_backend.py
--- a/verl/utils/reward_score/__init__.py
+++ b/verl/utils/reward_score/__init__.py
--- a/verl/utils/reward_score/gsm8k.py
+++ b/verl/utils/reward_score/gsm8k.py
--- a/verl/utils/reward_score/math.py
+++ b/verl/utils/reward_score/math.py
--- a/verl/utils/torch_dtypes.py
+++ b/verl/utils/torch_dtypes.py
--- a/verl/utils/torch_functional.py
+++ b/verl/utils/torch_functional.py
--- a/verl/utils/tracking.py
+++ b/verl/utils/tracking.py
--- a/verl/version/version
+++ b/verl/version/version