update README.md (#534)

1. add [PRIME](https://arxiv.org/abs/2502.01456) to README.md 2. slightly change the example script to align with the paper

update README.md (#534)
1. add [PRIME](https://arxiv.org/abs/2502.01456) to README.md 2. slightly change the example script to align with the paper
b14299c8 · Zefan Wang · GitHub · f0e7f9fc · b14299c8 · b14299c8
Unverified Commit b14299c8 authored Mar 11, 2025 by Zefan Wang Committed by GitHub Mar 11, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 3 additions and 3 deletions

README.md
+1 -1

recipe/prime/run_prime_qwen.sh
+2 -2

No files found.
--- a/README.md
+++ b/README.md
@@ -44,7 +44,7 @@ verl is fast with:
 - **vLLM** and **HF Transformers** for rollout generation, **SGLang** support coming soon.
 - Compatible with Hugging Face Transformers and Modelscope Hub.
 - Supervised fine-tuning.
- Reinforcement learning with [PPO](examples/ppo_trainer/), [GRPO](examples/grpo_trainer/), [ReMax](examples/remax_trainer/), [Reinforce++](https://verl.readthedocs.io/en/latest/examples/config.html#algorithm), [RLOO](examples/rloo_trainer/), etc.
+- Reinforcement learning with [PPO](examples/ppo_trainer/), [GRPO](examples/grpo_trainer/), [ReMax](examples/remax_trainer/), [Reinforce++](https://verl.readthedocs.io/en/latest/examples/config.html#algorithm), [RLOO](examples/rloo_trainer/), [PRIME](recipe/prime/), etc.
  - Support model-based reward and function-based reward (verifiable reward)
  - Support vision-language models (VLMs) and [multi-modal RL](examples/grpo_trainer/run_qwen2_5_vl-7b.sh)
 - Flash attention 2, [sequence packing](examples/ppo_trainer/run_qwen2-7b_seq_balance.sh), [sequence parallelism](examples/ppo_trainer/run_deepseek7b_llm_sp2.sh) support via DeepSpeed Ulysses, [LoRA](examples/sft/gsm8k/run_qwen_05_peft.sh), [Liger-kernel](examples/sft/gsm8k/run_qwen_05_sp2_liger.sh).

--- a/recipe/prime/run_prime_qwen.sh
+++ b/recipe/prime/run_prime_qwen.sh
@@ -22,10 +22,10 @@ python3 -m recipe.prime.main_prime \
    data.accuracy_upper_bound=0.8 \
    data.oversample_factor=4 \
    actor_rollout_ref.model.path=$model_path \
-    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.optim.lr=5e-7 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
-    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \