Files · 830fab2aedb7e4cd4fac18e7d1e4830620ee2f1d · ZhangXiaoyun / verl

[bugfix] Fix position embedding processing for Qwen2.5-VL (#527) · 830fab2a

[bugfix] Fix position embedding processing for Qwen2.5-VL

In the `RLHFDataset.__getitem__` method, a bug was identified in how
multimodal position IDs (3D in Qwen2.5-VL) are determined. Previously,
the code checked for `self.image_key in row_dict` to decide whether to
use multimodal position IDs. However, since `self.image_key` is popped
from `row_dict` during image token expansion, this check incorrectly
fails for subsequent operations.

This causes the VL model to use incorrect position IDs, resulting in
significant performance degradation.

<img width="349" alt="image"
src="https://github.com/user-attachments/assets/79790bbf-239e-4667-a2c5-d63d91d63165"
/>


The fix introduces an explicit `is_multi_modal` flag to properly track
multimodal content throughout the processing pipeline.

Co-authored-by: songyifan <songyifan3@xiaomi.com>

committed Mar 10, 2025

830fab2a

Name	Last commit	Last update
.github		Loading commit data...
docker		Loading commit data...
docs		Loading commit data...
examples		Loading commit data...
patches		Loading commit data...
scripts		Loading commit data...
tests		Loading commit data...
verl		Loading commit data...
.gitignore		Loading commit data...
.readthedocs.yaml		Loading commit data...
.style.yapf		Loading commit data...
LICENSE		Loading commit data...
Notice.txt		Loading commit data...
README.md		Loading commit data...
pyproject.toml		Loading commit data...
requirements.txt		Loading commit data...
setup.py		Loading commit data...

README.md