Commit 1e69b079 by nzy

readme: record sft orm's experiments

parent d631895d
......@@ -161,4 +161,7 @@ cython_debug/
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
readme.pdf
\ No newline at end of file
readme.pdf
*.json
*.jsonl
test_*
\ No newline at end of file
......@@ -52,6 +52,34 @@ template: deepseekcoder
stage: rm
```
### Additional Experiments
We want to see if different loss functions would affect model performance.
The Progress Reward Model (PRM) and Critic Model use the SFT loss—basically CrossEntropy.
The OutCome Reward Model uses a reward loss.
For details, check out the ***[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/models/loss.py)***.
Our main question is whether these two loss functions would give different results.
To find out, we create a new model called SFT orm.
This model is trained on the same dataset but with SFT loss, aiming to match the performance of the standard reward model (orm).
First, we use the hyperparameters from the llamafactory examples and set the epochs to 1, like the orm.
The results are bad; the SFT orm is only slightly better than random, far from the orm's performance.
Looking at [@lightman2023let], we see that the PRM needs more epochs to train well.
So, we train the SFT orm for 3 epochs. It improves but still don't match the orm.
This make us think the SFT loss might be less efficient in learning.
We guess the SFT orm just needs more data.
This aligns with [@lightman2023let]'s note that 2 epochs improve performance on smaller datasets.
More epochs don't help much after a point, especially on larger datasets.
| model | interview | competition | introductory |
| :---: | :-------: | :---------: | :-----------:|
| random | 21.4% | 8.7% | 34.4% |
| sftorm(epoch=3)| 36.5% | 27.2% | 42.3% |
| orm | 53.8% | 27.2% | 50% |
## Environment
Same as Llama-factory (Recommand Version)
......
......@@ -23,4 +23,10 @@
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2409.06957},
}
@article{lightman2023let,
title={Let's verify step by step},
author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},
journal={arXiv preprint arXiv:2305.20050},
year={2023}
}
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment