- Reinforcement learning from human feedback with [PPO](https://github.com/volcengine/verl/tree/main/examples/ppo_trainer) and [GRPO](https://github.com/volcengine/verl/tree/main/examples/grpo_trainer)
- Reinforcement learning from human feedback with [PPO](https://github.com/volcengine/verl/tree/main/examples/ppo_trainer) and [GRPO](https://github.com/volcengine/verl/tree/main/examples/grpo_trainer)
- Support model-based reward and function-based reward (verifiable reward)
- Support model-based reward and function-based reward (verifiable reward)
- flash-attention integration, sequence packing, and long context support via DeepSpeed Ulysses
- flash-attention, [sequence packing](examples/ppo_trainer/run_qwen2-7b_seq_balance.sh), [long context](examples/ppo_trainer/run_deepseek7b_llm_sp2.sh) support via DeepSpeed Ulysses, [LoRA](examples/sft/gsm8k/run_qwen_05_peft.sh), [Liger-kernel](examples/sft/gsm8k/run_qwen_05_sp2_liger.sh)
- scales up to 70B models and hundreds of GPUs
- scales up to 70B models and hundreds of GPUs
- experiment tracking with wandb and mlflow
- experiment tracking with wandb and mlflow
...
@@ -60,7 +60,7 @@ Checkout this [Jupyter Notebook](https://github.com/volcengine/verl/tree/main/ex
...
@@ -60,7 +60,7 @@ Checkout this [Jupyter Notebook](https://github.com/volcengine/verl/tree/main/ex
**Running a PPO example step-by-step:**
**Running a PPO example step-by-step:**
- Data and Reward Preparation
- Data and Reward Preparation
-[Prepare Data (Parquet) for Post-Training](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)
-[Prepare Data for Post-Training](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)
-[Implement Reward Function for Dataset](https://verl.readthedocs.io/en/latest/preparation/reward_function.html)
-[Implement Reward Function for Dataset](https://verl.readthedocs.io/en/latest/preparation/reward_function.html)
- Understanding the PPO Example
- Understanding the PPO Example
-[PPO Example Architecture](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html)
-[PPO Example Architecture](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html)