> We propose the **D**ecoupled Clip and Dynamic s**A**mpling **P**olicy **O**ptimization (DAPO) algorithm. By making our work publicly available, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements. Our system is based on the awesome [verl](https://github.com/volcengine/verl) framework. Thanks for their great work! Applying DAPO training to Qwen2.5-32B base model proves to outperform the previous state-of-the-art DeepSeek-R1-Zero-Qwen-32B on AIME 2024, achieving **50%** accuracy with **50%** less training steps.
> DAPO samples a group of outputs $\left\{o_i\right\}_{i=1}^G$ for each question $q$ paired with the answer $a$, and optimizes the policy via the following objective:
Setting `filter_groups.enable` to `True` will filter out groups whose outputs' `metric` are all the same, e.g., for `acc`, groups whose outputs' accuracies are all 1 or 0.
Setting `fill_to_train_bsz` to `True` will repeat sampling with `gen_batch_size` until there are enough qualified groups for `train_batch_size`.
Seting `drop_last_mini_batch` to `True` might be helpful when `fill_to_train_bsz` is `False` since the last mini-batch might be incomplete due to possibly strong filteration.
### Token-level Policy Gradient Loss
An example configuration:
```yaml
actor_rollout_ref:
actor:
use_token_level_loss:True
```
Setting `use_token_level_loss` to `True` will mean the policy gradient loss across all the tokens in all the sequences in a batch.
Setting `overlong_buffer.enable` to `True` will penalize the outputs whose length entering the last `overlong_buffer.len` tokens before the `max_response_length`.
The penalty increases linearly from 0 to `overlong_buffer.penalty_factor`.