Ich bin neu bei pytorch-distributed und jede Eingabe hilft. Ich habe einen Code, der mit einer einzelnen GPU arbeitet. Ich versuche, es zu verteilen. Ich erhalte einen Socket-Verbindungsfehler. Unten ist der Code (ich vermeide den Teil des Codes, der möglicherweise nicht das Problem darstellt). Ich nehme an, es ist ein Socket-Fehler.
$> Torchrun --nproc_per_node=4 --nnodes=1 train_dist.py
CODE:
[2025-01-15 22:44:52,198] torch.distributed.run: [WARNING]
[2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] *****************************************
[2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init
queued_call()
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 183, in _check_capability
capability = get_device_capability(d)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
prop = get_device_properties(device)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 284, in
main(args)
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 177, in main
device, rank, world_size = init_distributed()
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 33, in init_distributed
torch.cuda.set_device(local_rank)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 9, in
import torch
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/__init__.py", line 1427, in
_C._initExtension(manager_path())
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in
_lazy_call(_check_capability)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
Traceback (most recent call last):
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init
queued_call()
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 183, in _check_capability
capability = get_device_capability(d)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
prop = get_device_properties(device)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 284, in
main(args)
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 177, in main
device, rank, world_size = init_distributed()
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 33, in init_distributed
torch.cuda.set_device(local_rank)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 9, in
import torch
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/__init__.py", line 1427, in
_C._initExtension(manager_path())
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in
_lazy_call(_check_capability)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
[2025-01-15 22:44:57,235] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2018871) of binary: /home/nam123/.conda/envs/py39/bin/python
Traceback (most recent call last):
File "/home/nam123/.conda/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dist.py FAILED
------------------------------------------------------------a
Failures:
[1]:
time : 2025-01-15_22:44:57
host : sg049.sol.rc.asu.edu
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2018872)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-15_22:44:57
host : sg049.sol.rc.asu.edu
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2018871)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Ich habe LINK überprüft und keine Lösung hat geholfen
Ich bin neu bei pytorch-distributed und jede Eingabe hilft. Ich habe einen Code, der mit einer einzelnen GPU arbeitet. Ich versuche, es zu verteilen. Ich erhalte einen Socket-Verbindungsfehler. Unten ist der Code (ich vermeide den Teil des Codes, der möglicherweise nicht das Problem darstellt). Ich nehme an, es ist ein Socket-Fehler. $> Torchrun --nproc_per_node=4 --nnodes=1 train_dist.py CODE: [code]import datetime import os # os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import time import sys import numpy as np import torch from torch.utils.data import DataLoader, DistributedSampler from torch.utils.data.dataloader import default_collate from torch import nn import torch.nn.functional as F import torchvision from torchvision import transforms import torch.distributed as dist import utils
from scheduler import WarmupMultiStepLR
from datasets.ntu60_hoi import NTU60Subject import models.AR_pcd_flow as Models
# Function to initialize the distributed environment def init_distributed(): # Example using torch.distributed.launch:
# convert scheduler to be per iteration, not per epoch, for warmup that lasts # between different epochs warmup_iters = args.lr_warmup_epochs * len(data_loader) lr_milestones = [len(data_loader) * m for m in args.lr_milestones] lr_scheduler = WarmupMultiStepLR(optimizer, milestones=lr_milestones, gamma=args.lr_gamma, warmup_iters=warmup_iters, warmup_factor=1e-5)
if cur_acc > acc: # > 0.7 and cur_acc > acc: acc = cur_acc path = os.path.join(args.output_dir, f"model_{epoch}_ntu60_DTr.pth") torch.save(model.state_dict(), path) print("model saved") with open('NTU60_epoch.txt', 'a') as f: f.write(str(epoch) + '\n') [/code] Unten ist der FEHLER: [code][2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] [2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] ***************************************** [2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] ***************************************** Traceback (most recent call last): File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init queued_call() File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 183, in _check_capability capability = get_device_capability(d) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability prop = get_device_properties(device) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties return _get_device_properties(device) # type: ignore[name-defined] RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 284, in main(args) File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 177, in main device, rank, world_size = init_distributed() File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 33, in init_distributed torch.cuda.set_device(local_rank) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 408, in set_device torch._C._cuda_setDevice(device) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 9, in import torch File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/__init__.py", line 1427, in _C._initExtension(manager_path()) File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in _lazy_call(_check_capability) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call _queued_calls.append((callable, traceback.format_stack()))
Traceback (most recent call last): File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init queued_call() File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 183, in _check_capability capability = get_device_capability(d) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability prop = get_device_properties(device) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties return _get_device_properties(device) # type: ignore[name-defined] RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 284, in main(args) File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 177, in main device, rank, world_size = init_distributed() File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 33, in init_distributed torch.cuda.set_device(local_rank) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 408, in set_device torch._C._cuda_setDevice(device) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 9, in import torch File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/__init__.py", line 1427, in _C._initExtension(manager_path()) File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in _lazy_call(_check_capability) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call _queued_calls.append((callable, traceback.format_stack()))
[2025-01-15 22:44:57,235] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2018871) of binary: /home/nam123/.conda/envs/py39/bin/python Traceback (most recent call last): File "/home/nam123/.conda/envs/py39/bin/torchrun", line 8, in sys.exit(main()) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_dist.py FAILED ------------------------------------------------------------a Failures: [1]: time : 2025-01-15_22:44:57 host : sg049.sol.rc.asu.edu rank : 1 (local_rank: 1) exitcode : 1 (pid: 2018872) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-15_22:44:57 host : sg049.sol.rc.asu.edu rank : 0 (local_rank: 0) exitcode : 1 (pid: 2018871) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [/code] Ich habe LINK überprüft und keine Lösung hat geholfen
Ich bin neu in der KI-Entwicklung und ich versuche, ein Modell zu trainieren. Die Sache ist, dass es zwei AMD-GPUs auf dem Server gibt, die Radeon Rx 7600 XT sind und die CPU ein Ryzen 9 5900xt...
Ich habe ein Problem mit meinem Slurm -Code (ich bin ein ziemlicher Neuling). Task2, ..., Task28, mit Jobs parallel. Python -Funktionen werden nacheinander aufgerufen (nennen wir sie Py1.py, Py2.py,...
Ich leite einige Jobs in einem Slurm -GPU -Cluster mit dem Python -Paket für Subschritte und erhalte unabdlich einen seltsamen Fehler. Ich habe mehrere Anrufe, um meinen aktuellen Agenten (JAX...
Raumbekämpfung funktioniert für mich nicht, alle Ruks befinden sich zusammen in meinem Browser. .osn {
flex-direction: column;
display: flex;
padding-left: 1.5%;
}