Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
V
verl
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
ZhangXiaoyun
verl
Commits
e230de86
Unverified
Commit
e230de86
authored
Jan 14, 2025
by
hoshi-hiyouga
Committed by
GitHub
Jan 14, 2025
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Fix loss value for gradient accumulation > 1 (#102)
parent
1facb9d2
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
5 additions
and
2 deletions
+5
-2
verl/trainer/fsdp_sft_trainer.py
+5
-2
No files found.
verl/trainer/fsdp_sft_trainer.py
View file @
e230de86
...
...
@@ -256,9 +256,11 @@ class FSDPSFTTrainer(object):
micro_batches
=
batch
.
split
(
self
.
config
.
data
.
micro_batch_size
)
n_micro_batches
=
len
(
micro_batches
)
step_loss
=
0
for
micro_batch
in
micro_batches
:
loss
=
self
.
_compute_loss
(
batch
=
micro_batch
)
/
n_micro_batches
loss
.
backward
()
step_loss
+=
loss
.
item
()
self
.
fsdp_model
.
clip_grad_norm_
(
max_norm
=
self
.
config
.
optim
.
clip_grad
)
...
...
@@ -275,8 +277,9 @@ class FSDPSFTTrainer(object):
log_gpu_memory_usage
(
'After offload weights'
,
logger
=
logger
)
# TODO: all reduce to get accurate loss
return
{
'train/loss'
:
loss
.
detach
()
.
item
(),
'train/lr(1e-3)'
:
lr
*
1e3
}
step_loss
=
torch
.
tensor
(
step_loss
)
.
cuda
()
torch
.
distributed
.
all_reduce
(
step_loss
,
op
=
torch
.
distributed
.
ReduceOp
.
AVG
)
return
{
'train/loss'
:
step_loss
.
detach
()
.
item
(),
'train/lr(1e-3)'
:
lr
*
1e3
}
def
validation_step
(
self
,
batch
:
TensorDict
):
self
.
fsdp_model
.
eval
()
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment