Commit 39e79006 by Yaoyu Zhu

update gitignore

parent 646a6d3d
WARNING: Did not unuse /usr/share/Modules/modulefiles
No Modulefiles Currently Loaded.
Currently Loaded Modulefiles:
1) cluster-tools/v1.0 3) gcc/9.3.0
2) slurm-tools/v1.0 4) cuda-cudnn/11.8-8.8.1
/usr/bin/which: no python in (/tools/cluster-software/cuda-cudnn/cuda-11.8.0-8.8.1/bin:/tools/cluster-software/gcc/gcc-9.3.0/bin:/tools/cluster-software/slurm-tools/slurm-tools-v1.0/bin:/tools/cluster-software/cluster-tools/cluster-tools-v1.0/bin:/home/S/wuyt/.elan/bin:/home/S/wuyt/.cargo/bin:/home/S/wuyt/nfs_global/anaconda3/envs/deepscaler/bin:/home/S/wuyt/lustre/anaconda3/condabin:/home/S/wuyt/.local/bin:/home/S/wuyt/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/nfs_global/S/wuyt/.local/bin:/nfs_global/S/wuyt/wuyt/git-lfs/git-lfs-3.2.0)
Currently Loaded Modulefiles:
1) cluster-tools/v1.0 3) gcc/9.3.0
2) slurm-tools/v1.0 4) cuda-cudnn/11.8-8.8.1
/usr/bin/which: no python in (/tools/cluster-software/cuda-cudnn/cuda-11.8.0-8.8.1/bin:/tools/cluster-software/gcc/gcc-9.3.0/bin:/tools/cluster-software/slurm-tools/slurm-tools-v1.0/bin:/tools/cluster-software/cluster-tools/cluster-tools-v1.0/bin:/home/S/wuyt/.elan/bin:/home/S/wuyt/.cargo/bin:/home/S/wuyt/nfs_global/anaconda3/envs/deepscaler/bin:/home/S/wuyt/lustre/anaconda3/condabin:/home/S/wuyt/.local/bin:/home/S/wuyt/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/nfs_global/S/wuyt/.local/bin:/nfs_global/S/wuyt/wuyt/git-lfs/git-lfs-3.2.0)
wandb: Appending key for api.wandb.ai to your netrc file: /home/S/wuyt/.netrc
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /home/S/wuyt/.netrc
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
[2025-04-01 10:04:11,489 W 634026 634026] global_state_accessor.cc:429: Retrying to get node with node ID 2ab9de94a7185bdf0132a2fa88197f4972316281e3d52a2ad135506d
2025-04-01 10:04:20,398 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_443717d5f8aeb41f.zip.
2025-04-01 10:04:20,398 INFO packaging.py:575 -- Creating a file package for local module '.'.
train-multigpu.sh: line 220: 728523 Terminated copy_log_and_plot
chmod: changing permissions of '../tmp': Operation not permitted
cp: cannot create special file '../tmp/ray_wuyt/session_latest/sockets/plasma_store': File exists
cp: cannot create special file '../tmp/ray_wuyt/session_latest/sockets/raylet': File exists
/var/log/atop/atop_20250401 - stat raw file: No such file or directory
This source diff could not be displayed because it is too large. You can view the blob instead.
WARNING: Did not unuse /usr/share/Modules/modulefiles
No Modulefiles Currently Loaded.
Currently Loaded Modulefiles:
1) cluster-tools/v1.0 3) gcc/9.3.0
2) slurm-tools/v1.0 4) cuda-cudnn/11.8-8.8.1
Currently Loaded Modulefiles:
1) cluster-tools/v1.0 3) gcc/9.3.0
2) slurm-tools/v1.0 4) cuda-cudnn/11.8-8.8.1
wandb: Appending key for api.wandb.ai to your netrc file: /home/S/wuyt/.netrc
wandb: Appending key for api.wandb.ai to your netrc file: /home/S/wuyt/.netrc
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
[2025-04-02 01:06:35,166 W 3115729 3115729] global_state_accessor.cc:429: Retrying to get node with node ID 736407c8bd15a81bfe135e9345be09f72e478a7f772f20455efb29e7
2025-04-02 01:06:45,530 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_55d2a6863b4a199c.zip.
2025-04-02 01:06:45,531 INFO packaging.py:575 -- Creating a file package for local module '.'.
train-multigpu.sh: line 223: 595967 Terminated copy_log_and_plot
Traceback (most recent call last):
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 298, in <module>
plot_data(args.folder, no_ratio=args.no_ratio)
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 282, in plot_data
plot_different_accuracy_ratio(folder)
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 127, in plot_different_accuracy_ratio
df = pd.read_csv(csv_file_path)
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 620, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
self._engine = self._make_engine(f, self.engine)
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
return mapping[engine](f, **self.options)
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "parsers.pyx", line 581, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
chmod: changing permissions of '../tmp': Operation not permitted
cp: cannot create special file '../tmp/ray_wuyt/session_latest/sockets/plasma_store': File exists
cp: cannot create special file '../tmp/ray_wuyt/session_latest/sockets/raylet': File exists
/var/log/atop/atop_20250402 - stat raw file: No such file or directory
This source diff could not be displayed because it is too large. You can view the blob instead.
WARNING: Did not unuse /usr/share/Modules/modulefiles
No Modulefiles Currently Loaded.
Currently Loaded Modulefiles:
1) cluster-tools/v1.0 3) gcc/9.3.0
2) slurm-tools/v1.0 4) cuda-cudnn/11.8-8.8.1
wandb: Appending key for api.wandb.ai to your netrc file: /home/S/wuyt/.netrc
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
2025-04-02 01:14:14,510 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_55d2a6863b4a199c.zip.
2025-04-02 01:14:14,510 INFO packaging.py:575 -- Creating a file package for local module '.'.
Traceback (most recent call last):
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'reward/correct_0%_ratio'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 298, in <module>
plot_data(args.folder, no_ratio=args.no_ratio)
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 282, in plot_data
plot_different_accuracy_ratio(folder)
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 131, in plot_different_accuracy_ratio
df[f'{col}_smoothed'] = smooth_data(df[col])
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
indexer = self.columns.get_loc(key)
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'reward/correct_0%_ratio'
Traceback (most recent call last):
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'reward/correct_0%_ratio'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 298, in <module>
plot_data(args.folder, no_ratio=args.no_ratio)
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 282, in plot_data
plot_different_accuracy_ratio(folder)
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 131, in plot_different_accuracy_ratio
df[f'{col}_smoothed'] = smooth_data(df[col])
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
indexer = self.columns.get_loc(key)
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'reward/correct_0%_ratio'
train-multigpu.sh: line 223: 618003 Terminated copy_log_and_plot
Traceback (most recent call last):
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 298, in <module>
plot_data(args.folder, no_ratio=args.no_ratio)
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 282, in plot_data
plot_different_accuracy_ratio(folder)
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 131, in plot_different_accuracy_ratio
df[f'{col}_smoothed'] = smooth_data(df[col])
File "/nfs_global/S/zhuyaoyu/projects/verl/plot_and_analyze/plot.py", line 44, in smooth_data
return data.rolling(window=window_size, min_periods=1).mean()
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/generic.py", line 12580, in rolling
return Rolling(
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/window/rolling.py", line 170, in __init__
self._validate()
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/window/rolling.py", line 1869, in _validate
super()._validate()
File "/workspace/S/zhuyaoyu/softwares/miniconda3/envs/verl/lib/python3.10/site-packages/pandas/core/window/rolling.py", line 181, in _validate
raise ValueError(
ValueError: min_periods 1 must be <= window 0
chmod: changing permissions of '../tmp': Operation not permitted
cp: cannot create special file '../tmp/ray_wuyt/session_latest/sockets/plasma_store': File exists
cp: cannot create special file '../tmp/ray_wuyt/session_latest/sockets/raylet': File exists
/var/log/atop/atop_20250402 - stat raw file: No such file or directory
This source diff could not be displayed because it is too large. You can view the blob instead.
WARNING: Did not unuse /usr/share/Modules/modulefiles
No Modulefiles Currently Loaded.
Currently Loaded Modulefiles:
1) cluster-tools/v1.0 3) gcc/9.3.0
2) slurm-tools/v1.0 4) cuda-cudnn/11.8-8.8.1
Currently Loaded Modulefiles:
1) cluster-tools/v1.0 3) gcc/9.3.0
2) slurm-tools/v1.0 4) cuda-cudnn/11.8-8.8.1
wandb: Appending key for api.wandb.ai to your netrc file: /home/S/wuyt/.netrc
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /home/S/wuyt/.netrc
wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
[2025-04-02 11:51:05,525 W 3127793 3127793] global_state_accessor.cc:429: Retrying to get node with node ID b26f7a02826b437d1ff5aec846a4896c59f24a33b9b656ec3cd16f56
2025-04-02 11:51:16,404 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_75d347a5920743dc.zip.
2025-04-02 11:51:16,404 INFO packaging.py:575 -- Creating a file package for local module '.'.
train-multigpu.sh: line 223: 701843 Terminated copy_log_and_plot
chmod: changing permissions of '../tmp': Operation not permitted
cp: cannot create special file '../tmp/ray_wuyt/session_latest/sockets/plasma_store': File exists
cp: cannot create special file '../tmp/ray_wuyt/session_latest/sockets/raylet': File exists
/var/log/atop/atop_20250402 - stat raw file: No such file or directory
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
Currently Loaded Modulefiles:
1) git/2.31.1 2) gcc/9.3.0 3) cmake/3.21.7
Currently Loaded Modulefiles:
1) git/2.31.1 3) cmake/3.21.7 5) slurm-tools/v1.0
2) gcc/9.3.0 4) cluster-tools/v1.0 6) cuda-cudnn/12.1-8.9.3
Job start at 2025-04-04 08:51:30
Job run at:
Static hostname: localhost.localdomain
Transient hostname: r8l40-a00.ib.future.cn
Icon name: computer-server
Chassis: server
Machine ID: 5a5f22d1ca484ec4bb0c3310c788be8b
Boot ID: 870c9831f3b64f2ca8b3258b37fb8613
Operating System: Rocky Linux 8.7 (Green Obsidian)
CPE OS Name: cpe:/o:rocky:rocky:8:GA
Kernel: Linux 4.18.0-425.10.1.el8_7.x86_64
Architecture: x86-64
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rl-root 376G 18G 358G 5% /
/dev/nvme4n1p1 3.5T 25G 3.5T 1% /local
/dev/nvme2n1p1 3.5T 29G 3.5T 1% /tmp
/dev/mapper/rl-var 512G 9.9G 502G 2% /var
/dev/nvme0n1p2 2.0G 366M 1.7G 18% /boot
/dev/nvme1n1p1 3.5T 43G 3.5T 2% /local/nfscache
/dev/nvme0n1p1 599M 5.8M 594M 1% /boot/efi
ssd.nas00.future.cn:/rocky8_home 16G 3.3G 13G 21% /home
ssd.nas00.future.cn:/rocky8_workspace 400G 239G 162G 60% /workspace
ssd.nas00.future.cn:/rocky8_tools 5.0T 75G 5.0T 2% /tools
ssd.nas00.future.cn:/centos7_home 16G 7.6G 8.5G 47% /centos7/home
ssd.nas00.future.cn:/centos7_workspace 400G 5.2G 395G 2% /centos7/workspace
ssd.nas00.future.cn:/centos7_tools 5.0T 235G 4.8T 5% /centos7/tools
ssd.nas00.future.cn:/eda-tools 8.0T 5.7T 2.4T 72% /centos7/eda-tools
hdd.nas00.future.cn:/share_personal 500G 414M 500G 1% /share/personal
zone05.nas01.future.cn:/NAS_HPC_collab_codemodel 34T 33T 858G 98% /share/collab/codemodel
ext-zone00.nas02.future.cn:/nfs_global 289T 276T 14T 96% /nfs_global
ssd.nas00.future.cn:/common_datasets 75T 63T 13T 84% /datasets
192.168.12.10@o2ib:192.168.12.11@o2ib:/lustre 1.9P 54T 1.7P 4% /lustre
beegfs_nodev 70T 15T 56T 21% /fast
Have already added /tools/cluster-modulefiles into $MODULEPATH
/tools/cluster-software/gcc/gcc-9.3.0/bin/gcc
/workspace/S/zhuyaoyu/softwares/miniconda3/bin/python
/workspace/S/zhuyaoyu/softwares/miniconda3/bin/python3
############### /home : /home/S/zhuyaoyu
Disk quotas for user zhuyaoyu (uid 6207):
Filesystem space quota limit grace files quota limit grace
/home 3353M 16384M 20480M 90671 0 0
############### /workspace
Disk quotas for user zhuyaoyu (uid 6207):
Filesystem space quota limit grace files quota limit grace
/workspace 239G 400G 500G 799k 0 0
############### /nfs_global
Disk quotas for user zhuyaoyu (uid 6207):
Filesystem space quota limit grace files quota limit grace
/nfs_global 2410G 5120G 7168G 2069k 5000k 10000k
############### /lustre
Disk quotas for usr zhuyaoyu (uid 6207):
Filesystem used quota limit grace files quota limit grace
/lustre 0k 8T 10T - 0 3000000 36000000 -
uid 6207 is using default block quota setting
uid 6207 is using default file quota setting
name, driver_version, power.limit [W]
NVIDIA L40, 550.54.15, 275.00 W
NVIDIA L40, 550.54.15, 275.00 W
NVIDIA L40, 550.54.15, 275.00 W
NVIDIA L40, 550.54.15, 275.00 W
NVIDIA L40, 550.54.15, 275.00 W
NVIDIA L40, 550.54.15, 275.00 W
NVIDIA L40, 550.54.15, 275.00 W
NVIDIA L40, 550.54.15, 275.00 W
Using GPU(s) 0,1,2,3,4,5,6,7
This job is assigned the following resources by SLURM:
CPU_IDs=0-31,56-87 GRES=gpu:8(IDX:0-7)
Have already added /tools/cluster-modulefiles into $MODULEPATH
Got device mesh tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
dtype=torch.int32), mesh_dim_names ('fsdp',)
Processing model shards with 16 (16,) in total
Writing to local disk
Saving model to ckpt/codev_distill_16k_vllm1_v2/global_step_120/actor/huggingface
Got device mesh tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
dtype=torch.int32), mesh_dim_names ('fsdp',)
Processing model shards with 16 (16,) in total
Writing to local disk
Saving model to ckpt/codev_distill_16k_vllm1_v2/global_step_140/actor/huggingface
Job end at 2025-04-04 08:53:38
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment