Fackel.elastic.elastic.multiProcessing.Erors.ChildFailedError / "HIP -Fehler: Ungültige Gerätefunktion" Nach AMD AI -Tra

Anonymous · Post by **Anonymous** » 11 Apr 2025, 12:35

Ich bin neu in der KI-Entwicklung und ich versuche, ein Modell zu trainieren. Die Sache ist, dass es zwei AMD-GPUs auf dem Server gibt, die Radeon Rx 7600 XT sind und die CPU ein Ryzen 9 5900xt 16-Core ist, und ich hatte bereits mehrere Probleme, als ich die Trainingscode von Grund auf neu und eine von GPU und GPU verwendete. /> Ich habe meinen Ansatz geändert und habe die offizielle AMD -Dokumentation für das KI -Training verfolgt.

Code: Select all

Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Writing logs to /workspace/notebooks/result/logs/log_1744287800.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in 
[rank1]:     sys.exit(recipe_main())
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank1]:     sys.exit(recipe_main(conf))
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main
[rank1]:     recipe.setup(cfg=cfg)
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup
[rank1]:     self._model = self._setup_model(
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model
[rank1]:     m.rope_init()
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init
[rank1]:     ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim)
[rank1]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank1]:     return func(*args, **kwargs)
[rank1]: RuntimeError: HIP error: invalid device function
[rank1]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank1]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in 
[rank0]:     sys.exit(recipe_main())
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank0]:     sys.exit(recipe_main(conf))
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main
[rank0]:     recipe.setup(cfg=cfg)
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup
[rank0]:     self._model = self._setup_model(
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model
[rank0]:     m.rope_init()
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init
[rank0]:     ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim)
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]: RuntimeError: HIP error: invalid device function
[rank0]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank0]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

[rank0]:[W410 12:23:26.761619117 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources.  For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0410 12:23:27.490000 7605 site-packages/torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 7738) of binary: /opt/conda/envs/py_3.10/bin/python3
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/bin/tune", line 8, in 
sys.exit(main())
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
parser.run(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
args.func(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED
< /code>
Das Ganze wird gemäß den Dokumentation in einen Docker gestoßen. Ich habe versucht, eine Umgebungsvariable auf Bashrc zu schreiben, um sie dauerhaft festzulegen, weil ich dachte, das [url=viewtopic.php?t=20324]Problem[/url] wäre, dass meine GFX nicht unterstützt wurde:
echo "export PYTORCH_ROCM_ARCH=gfx1102" >> ~/.bashrc source ~/.bashrc< /code>
Dann klonen Sie das Git -Projekt lokal, dann < /p>
cd /workspace/pytorch
pip install -r requirements.txt
python setup.py install
< /code>
}
Letztendlich habe ich das Training mit dem Tune -Lauf wiedergegeben -nproc_per_node 2 full_finetune_distributed - -config /Workspace /Notizbücher my_custom_config_distributed.yaml < /code, aber es gab mir das gleiche Fehler. Voraussetzungen für GPUs, um es zu schaffen, um es zu schaffen, und meine GPU ist nicht in diesem Tisch, aber ich habe online jemanden gefunden, der versucht hat, diesem Leitfaden zu folgen, auch wenn ihre GPU nicht auf diesem Tisch war, und sie sagten, sie haben es irgendwie abgerissen ...
Ich weiß, dass es die Möglichkeit gibt, es zu tun, aber ich kann nicht herausfinden, wie. Der Server ist ein Linux Ubuntu, weitere Informationen: < /p>
PRETTY_NAME="Ubuntu 22.04.5 LTS"

NAME="Ubuntu"

VERSION_ID="22.04"

VERSION="22.04.5 LTS (Jammy Jellyfish)".

Fackel.elastic.elastic.multiProcessing.Erors.ChildFailedError / "HIP -Fehler: Ungültige Gerätefunktion" Nach AMD AI -Tra

Fackel.elastic.elastic.multiProcessing.Erors.ChildFailedError / "HIP -Fehler: Ungültige Gerätefunktion" Nach AMD AI -Tra ⇐ Python

Quick Reply