Ich bin neu in der KI-Entwicklung und ich versuche, ein Modell zu trainieren. Die Sache ist, dass es zwei AMD-GPUs auf dem Server gibt, die Radeon Rx 7600 XT sind und die CPU ein Ryzen 9 5900xt 16-Core ist, und ich hatte bereits mehrere Probleme, als ich die Trainingscode von Grund auf neu und eine von GPU und GPU verwendete. /> Ich habe meinen Ansatz geändert und habe die offizielle AMD -Dokumentation für das KI -Training verfolgt.
Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Writing logs to /workspace/notebooks/result/logs/log_1744287800.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in
[rank1]: sys.exit(recipe_main())
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank1]: sys.exit(recipe_main(conf))
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main
[rank1]: recipe.setup(cfg=cfg)
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup
[rank1]: self._model = self._setup_model(
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model
[rank1]: m.rope_init()
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init
[rank1]: ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim)
[rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank1]: return func(*args, **kwargs)
[rank1]: RuntimeError: HIP error: invalid device function
[rank1]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank1]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in
[rank0]: sys.exit(recipe_main())
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank0]: sys.exit(recipe_main(conf))
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main
[rank0]: recipe.setup(cfg=cfg)
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup
[rank0]: self._model = self._setup_model(
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model
[rank0]: m.rope_init()
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init
[rank0]: ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim)
[rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: RuntimeError: HIP error: invalid device function
[rank0]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing AMD_SERIALIZE_KERNEL=3
[rank0]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[rank0]:[W410 12:23:26.761619117 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0410 12:23:27.490000 7605 site-packages/torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 7738) of binary: /opt/conda/envs/py_3.10/bin/python3
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/bin/tune", line 8, in
sys.exit(main())
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
parser.run(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
args.func(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED
< /code>
Das Ganze wird gemäß den Dokumentation in einen Docker gestoßen. Ich habe versucht, eine Umgebungsvariable auf Bashrc zu schreiben, um sie dauerhaft festzulegen, weil ich dachte, das [url=viewtopic.php?t=20324]Problem[/url] wäre, dass meine GFX nicht unterstützt wurde:
echo "export PYTORCH_ROCM_ARCH=gfx1102" >> ~/.bashrc source ~/.bashrc< /code>
Dann klonen Sie das Git -Projekt lokal, dann < /p>
cd /workspace/pytorch
pip install -r requirements.txt
python setup.py install
< /code>
}
Letztendlich habe ich das Training mit dem Tune -Lauf wiedergegeben -nproc_per_node 2 full_finetune_distributed - -config /Workspace /Notizbücher my_custom_config_distributed.yaml < /code, aber es gab mir das gleiche Fehler. Voraussetzungen für GPUs, um es zu schaffen, um es zu schaffen, und meine GPU ist nicht in diesem Tisch, aber ich habe online jemanden gefunden, der versucht hat, diesem Leitfaden zu folgen, auch wenn ihre GPU nicht auf diesem Tisch war, und sie sagten, sie haben es irgendwie abgerissen ...
Ich weiß, dass es die Möglichkeit gibt, es zu tun, aber ich kann nicht herausfinden, wie. Der Server ist ein Linux Ubuntu, weitere Informationen: < /p>
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)".
Ich bin neu in der KI-Entwicklung und ich versuche, ein Modell zu trainieren. Die Sache ist, dass es zwei AMD-GPUs auf dem Server gibt, die Radeon Rx 7600 XT sind und die CPU ein Ryzen 9 5900xt 16-Core ist, und ich hatte bereits mehrere Probleme, als ich die Trainingscode von Grund auf neu und eine von GPU und GPU verwendete. /> Ich habe meinen Ansatz geändert und habe die offizielle AMD -Dokumentation für das KI -Training verfolgt.[code]Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further. Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0 Writing logs to /workspace/notebooks/result/logs/log_1744287800.txt Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ... [rank1]: Traceback (most recent call last): [rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in [rank1]: sys.exit(recipe_main()) [rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper [rank1]: sys.exit(recipe_main(conf)) [rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main [rank1]: recipe.setup(cfg=cfg) [rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup [rank1]: self._model = self._setup_model( [rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model [rank1]: m.rope_init() [rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init [rank1]: ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim) [rank1]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__ [rank1]: return func(*args, **kwargs) [rank1]: RuntimeError: HIP error: invalid device function [rank1]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank1]: For debugging consider passing AMD_SERIALIZE_KERNEL=3 [rank1]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[rank0]: Traceback (most recent call last): [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 955, in [rank0]: sys.exit(recipe_main()) [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper [rank0]: sys.exit(recipe_main(conf)) [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 949, in recipe_main [rank0]: recipe.setup(cfg=cfg) [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 296, in setup [rank0]: self._model = self._setup_model( [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 607, in _setup_model [rank0]: m.rope_init() [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/models/llama3_1/_position_embeddings.py", line 69, in rope_init [rank0]: ** (torch.arange(0, self.dim, 2)[: (self.dim // 2)].float() / self.dim) [rank0]: File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__ [rank0]: return func(*args, **kwargs) [rank0]: RuntimeError: HIP error: invalid device function [rank0]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. [rank0]: For debugging consider passing AMD_SERIALIZE_KERNEL=3 [rank0]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
[rank0]:[W410 12:23:26.761619117 ProcessGroupNCCL.cpp:1487] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) E0410 12:23:27.490000 7605 site-packages/torch/distributed/elastic/multiprocessing/api.py:870] failed (exitcode: 1) local_rank: 0 (pid: 7738) of binary: /opt/conda/envs/py_3.10/bin/python3 Traceback (most recent call last): File "/opt/conda/envs/py_3.10/bin/tune", line 8, in sys.exit(main()) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main parser.run(args) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run args.func(args) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd self._run_distributed(args, is_builtin=is_builtin) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed run(args) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run elastic_launch( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/conda/envs/py_3.10/lib/python3.10/site-packages/recipes/full_finetune_distributed.py FAILED < /code> Das Ganze wird gemäß den Dokumentation in einen Docker gestoßen. Ich habe versucht, eine Umgebungsvariable auf Bashrc zu schreiben, um sie dauerhaft festzulegen, weil ich dachte, das [url=viewtopic.php?t=20324]Problem[/url] wäre, dass meine GFX nicht unterstützt wurde: echo "export PYTORCH_ROCM_ARCH=gfx1102" >> ~/.bashrc source ~/.bashrc< /code> Dann klonen Sie das Git -Projekt lokal, dann < /p> cd /workspace/pytorch pip install -r requirements.txt python setup.py install < /code> } Letztendlich habe ich das Training mit dem Tune -Lauf wiedergegeben -nproc_per_node 2 full_finetune_distributed - -config /Workspace /Notizbücher my_custom_config_distributed.yaml < /code, aber es gab mir das gleiche Fehler. Voraussetzungen für GPUs, um es zu schaffen, um es zu schaffen, und meine GPU ist nicht in diesem Tisch, aber ich habe online jemanden gefunden, der versucht hat, diesem Leitfaden zu folgen, auch wenn ihre GPU nicht auf diesem Tisch war, und sie sagten, sie haben es irgendwie abgerissen ... Ich weiß, dass es die Möglichkeit gibt, es zu tun, aber ich kann nicht herausfinden, wie. Der Server ist ein Linux Ubuntu, weitere Informationen: < /p> PRETTY_NAME="Ubuntu 22.04.5 LTS"
Ich habe eine Datenpipeline, in der ich einen gesamten Polars DataFrame vom Koreanischen ins Englische übersetzen muss. Derzeit habe ich einen funktionierenden, aber manuellen Prozess:
Ich bin neu bei pytorch-distributed und jede Eingabe hilft. Ich habe einen Code, der mit einer einzelnen GPU arbeitet. Ich versuche, es zu verteilen. Ich erhalte einen Socket-Verbindungsfehler. Unten...
Ich versuche, ein Yolov11 -Modell für eine Reihe von Bildern zu trainieren, aber ich stoße in ein Problem. Die GPU Das Modell trainiert ein Geforce RTX 4070 Super. print(torch.__version__)...