Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
codecritic
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Ziyuan Nan
codecritic
Commits
1e69b079
Commit
1e69b079
authored
Oct 18, 2024
by
nzy
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
readme: record sft orm's experiments
parent
d631895d
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
40 additions
and
2 deletions
+40
-2
.gitignore
+5
-2
readme.qmd
+28
-0
refs.bib
+7
-0
No files found.
.gitignore
View file @
1e69b079
...
...
@@ -161,4 +161,7 @@ cython_debug/
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
readme.pdf
\ No newline at end of file
readme.pdf
*.json
*.jsonl
test_*
\ No newline at end of file
readme.qmd
View file @
1e69b079
...
...
@@ -52,6 +52,34 @@ template: deepseekcoder
stage: rm
```
### Additional Experiments
We want to see if different loss functions would affect model performance.
The Progress Reward Model (PRM) and Critic Model use the SFT loss—basically CrossEntropy.
The OutCome Reward Model uses a reward loss.
For details, check out the ***[OpenRLHF](https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/models/loss.py)***.
Our main question is whether these two loss functions would give different results.
To find out, we create a new model called SFT orm.
This model is trained on the same dataset but with SFT loss, aiming to match the performance of the standard reward model (orm).
First, we use the hyperparameters from the llamafactory examples and set the epochs to 1, like the orm.
The results are bad; the SFT orm is only slightly better than random, far from the orm's performance.
Looking at [@lightman2023let], we see that the PRM needs more epochs to train well.
So, we train the SFT orm for 3 epochs. It improves but still don't match the orm.
This make us think the SFT loss might be less efficient in learning.
We guess the SFT orm just needs more data.
This aligns with [@lightman2023let]'s note that 2 epochs improve performance on smaller datasets.
More epochs don't help much after a point, especially on larger datasets.
| model | interview | competition | introductory |
| :---: | :-------: | :---------: | :-----------:|
| random | 21.4% | 8.7% | 34.4% |
| sftorm(epoch=3)| 36.5% | 27.2% | 42.3% |
| orm | 53.8% | 27.2% | 50% |
## Environment
Same as Llama-factory (Recommand Version)
...
...
refs.bib
View file @
1e69b079
...
...
@@ -23,4 +23,10 @@
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2409.06957},
}
@article{lightman2023let,
title={Let's verify step by step},
author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl},
journal={arXiv preprint arXiv:2305.20050},
year={2023}
}
\ No newline at end of file
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment