readme: record sft orm's experiments

1e69b079 · nzy · d631895d · 1e69b079 · 1e69b079 · 1e69b079
Commit 1e69b079 authored Oct 18, 2024 by nzy
Hide whitespace changes
Inline Side-by-side

Showing with 40 additions and 2 deletions

.gitignore
+5 -2

readme.qmd
+28 -0

refs.bib
+7 -0

No files found.
--- a/.gitignore
+++ b/.gitignore
@@ -161,4 +161,7 @@ cython_debug/
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
 readme.pdf
\ No newline at end of file
+*.json
+*.jsonl
+test_*
\ No newline at end of file
--- a/readme.qmd
+++ b/readme.qmd
@@ -52,6 +52,34 @@ template: deepseekcoder
 stage: rm
 ```
+### Additional Experiments
+We want to see if different loss functions would affect model performance.
+The Progress Reward Model (PRM) and Critic Model use the SFT loss—basically CrossEntropy.
+The OutCome Reward Model uses a reward loss.
+For details, check out the ***[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/models/loss.py)***.
+Our main question is whether these two loss functions would give different results.
+To find out, we create a new model called SFT orm.
+This model is trained on the same dataset but with SFT loss, aiming to match the performance of the standard reward model (orm).
+First, we use the hyperparameters from the llamafactory examples and set the epochs to 1, like the orm.
+The results are bad; the SFT orm is only slightly better than random, far from the orm's performance.
+Looking at [@lightman2023let], we see that the PRM needs more epochs to train well.
+So, we train the SFT orm for 3 epochs. It improves but still don't match the orm.
+This make us think the SFT loss might be less efficient in learning.
+We guess the SFT orm just needs more data.
+This aligns with [@lightman2023let]'s note that 2 epochs improve performance on smaller datasets.
+More epochs don't help much after a point, especially on larger datasets.
+| model          | interview | competition | introductory |
+| :---:          | :-------: | :---------: | :-----------:|
+| random         | 21.4%     | 8.7%        | 34.4%        |
+| sftorm(epoch=3)| 36.5%     | 27.2%       | 42.3%        |
+| orm            | 53.8%     | 27.2%       | 50%          |
 ## Environment
 Same as Llama-factory (Recommand Version)

--- a/refs.bib
+++ b/refs.bib
@@ -23,4 +23,10 @@
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2409.06957}, 
+}
+@article{lightman2023let,
+  title={Let's verify step by step},
+  author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},
+  journal={arXiv preprint arXiv:2305.20050},
+  year={2023}
 }
\ No newline at end of file