Files · 872022d0603112c4b76eac2d4305c333220dd361 · ZhangXiaoyun / verl

[perf] fix: correct meta weight init error to support hsdp (#508) · 872022d0

Current bugs when enable hsdp:
- **Incorrect Division in Batch Sizes**
- `ppo_micro_batch`, `ppo_minibatch`, etc... should be divided by
`self.device_mesh.size()` instead of `self.device_mesh.shape[0]`.
- **Improper Weight Initialization** in
`get_init_weight_context_manager`
- The `get_init_weight_context_manager` function must initialize empty
weights only on local_rank == 0 within every fsdp mesh.
- When `sync_module_states=True`, PyTorch's FSDP first broadcasts
parameters within the fsdp process group and then within the ddp process
group. If weights are not initialized correctly on `local_rank == 0` of
each fsdp mesh, the synchronization process may fail or produce
incorrect results.
https://github.com/pytorch/pytorch/blob/3f069e7679588d5ee4b1d5b2492ca0e20f9320b5/torch/distributed/fsdp/_init_utils.py#L614-L621
- Ensure initialization occurs only when
`self.device_mesh.get_coordinate()[-1] == 0`, which corresponds to
`local_rank == 0 `within each fsdp mesh.

committed Mar 10, 2025

872022d0

Name	Last commit	Last update
.github		Loading commit data...
docker		Loading commit data...
docs		Loading commit data...
examples		Loading commit data...
patches		Loading commit data...
scripts		Loading commit data...
tests		Loading commit data...
verl		Loading commit data...
.gitignore		Loading commit data...
.readthedocs.yaml		Loading commit data...
.style.yapf		Loading commit data...
LICENSE		Loading commit data...
Notice.txt		Loading commit data...
README.md		Loading commit data...
pyproject.toml		Loading commit data...
requirements.txt		Loading commit data...
setup.py		Loading commit data...

README.md