Unverified Commit 41b7c583 by Franz Srambical Committed by GitHub

fix: typo (explanation) (#167)

parent e7fd415a
.. _config-explain-page:
Config Explaination
Config Explanation
===================
ppo_trainer.yaml for FSDP Backend
......
......@@ -101,7 +101,7 @@ Step 4: Perform PPO training with your model on GSM8K Dataset
- Users could replace the ``data.train_files`` ,\ ``data.val_files``,
``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
their environment.
- See :doc:`config` for detailed explaination of each config field.
- See :doc:`config` for detailed explanation of each config field.
**Reward Model/Function**
......
......@@ -136,7 +136,7 @@ If you encounter out of memory issues with HBM less than 32GB, enable the follow
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
critic.ppo_micro_batch_size_per_gpu=1 \
For the full set of configs, please refer to :ref:`config-explain-page` for detailed explaination and performance tuning.
For the full set of configs, please refer to :ref:`config-explain-page` for detailed explanation and performance tuning.
.. [1] The original paper (https://arxiv.org/pdf/2110.14168) mainly focuses on training a verifier (a reward model) to solve math problems via Best-of-N sampling. In this example, we train an RL agent using a rule-based reward model.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment