Ich bin neu bei pytorch-distributed und jede Eingabe hilft. Ich habe einen Code, der mit einer einzelnen GPU arbeitet. Ich versuche, es zu verteilen. Ich erhalte einen Socket-Verbindungsfehler. Unten ist der Code (ich vermeide den Teil des Codes, der möglicherweise nicht das Problem darstellt). Ich nehme an, es ist ein Socket-Fehler.
$> Torchrun --nproc_per_node=4 --nnodes=1 train_dist.py
CODE:
[2025-01-15 22:44:52,198] torch.distributed.run: [WARNING]
[2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] *****************************************
[2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] *****************************************
Traceback (most recent call last):
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init
queued_call()
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 183, in _check_capability
capability = get_device_capability(d)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
prop = get_device_properties(device)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 284, in
main(args)
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 177, in main
device, rank, world_size = init_distributed()
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 33, in init_distributed
torch.cuda.set_device(local_rank)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 9, in
import torch
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/__init__.py", line 1427, in
_C._initExtension(manager_path())
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in
_lazy_call(_check_capability)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
Traceback (most recent call last):
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init
queued_call()
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 183, in _check_capability
capability = get_device_capability(d)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability
prop = get_device_properties(device)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 284, in
main(args)
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 177, in main
device, rank, world_size = init_distributed()
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 33, in init_distributed
torch.cuda.set_device(local_rank)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 408, in set_device
torch._C._cuda_setDevice(device)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 9, in
import torch
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/__init__.py", line 1427, in
_C._initExtension(manager_path())
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in
_lazy_call(_check_capability)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
[2025-01-15 22:44:57,235] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2018871) of binary: /home/nam123/.conda/envs/py39/bin/python
Traceback (most recent call last):
File "/home/nam123/.conda/envs/py39/bin/torchrun", line 8, in
sys.exit(main())
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dist.py FAILED
------------------------------------------------------------a
Failures:
[1]:
time : 2025-01-15_22:44:57
host : sg049.sol.rc.asu.edu
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2018872)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-15_22:44:57
host : sg049.sol.rc.asu.edu
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2018871)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Ich habe LINK überprüft und keine Lösung hat geholfen
Ich bin neu bei pytorch-distributed und jede Eingabe hilft. Ich habe einen Code, der mit einer einzelnen GPU arbeitet. Ich versuche, es zu verteilen. Ich erhalte einen Socket-Verbindungsfehler. Unten ist der Code (ich vermeide den Teil des Codes, der möglicherweise nicht das Problem darstellt). Ich nehme an, es ist ein Socket-Fehler. $> Torchrun --nproc_per_node=4 --nnodes=1 train_dist.py CODE: [code]import datetime import os # os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import time import sys import numpy as np import torch from torch.utils.data import DataLoader, DistributedSampler from torch.utils.data.dataloader import default_collate from torch import nn import torch.nn.functional as F import torchvision from torchvision import transforms import torch.distributed as dist import utils
from scheduler import WarmupMultiStepLR
from datasets.ntu60_hoi import NTU60Subject import models.AR_pcd_flow as Models
# Function to initialize the distributed environment def init_distributed(): # Example using torch.distributed.launch:
# convert scheduler to be per iteration, not per epoch, for warmup that lasts # between different epochs warmup_iters = args.lr_warmup_epochs * len(data_loader) lr_milestones = [len(data_loader) * m for m in args.lr_milestones] lr_scheduler = WarmupMultiStepLR(optimizer, milestones=lr_milestones, gamma=args.lr_gamma, warmup_iters=warmup_iters, warmup_factor=1e-5)
if cur_acc > acc: # > 0.7 and cur_acc > acc: acc = cur_acc path = os.path.join(args.output_dir, f"model_{epoch}_ntu60_DTr.pth") torch.save(model.state_dict(), path) print("model saved") with open('NTU60_epoch.txt', 'a') as f: f.write(str(epoch) + '\n') [/code] Unten ist der FEHLER: [code][2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] [2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] ***************************************** [2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2025-01-15 22:44:52,198] torch.distributed.run: [WARNING] ***************************************** Traceback (most recent call last): File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init queued_call() File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 183, in _check_capability capability = get_device_capability(d) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability prop = get_device_properties(device) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties return _get_device_properties(device) # type: ignore[name-defined] RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 284, in main(args) File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 177, in main device, rank, world_size = init_distributed() File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 33, in init_distributed torch.cuda.set_device(local_rank) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 408, in set_device torch._C._cuda_setDevice(device) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 9, in import torch File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/__init__.py", line 1427, in _C._initExtension(manager_path()) File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in _lazy_call(_check_capability) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call _queued_calls.append((callable, traceback.format_stack()))
Traceback (most recent call last): File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 315, in _lazy_init queued_call() File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 183, in _check_capability capability = get_device_capability(d) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 439, in get_device_capability prop = get_device_properties(device) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 457, in get_device_properties return _get_device_properties(device) # type: ignore[name-defined] RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 284, in main(args) File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 177, in main device, rank, world_size = init_distributed() File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 33, in init_distributed torch.cuda.set_device(local_rank) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 408, in set_device torch._C._cuda_setDevice(device) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 321, in _lazy_init raise DeferredCudaCallError(msg) from e torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=1, num_gpus=
CUDA call was originally invoked at:
File "/scratch/nam123/HOI4D_ctr/train_dist.py", line 9, in import torch File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/__init__.py", line 1427, in _C._initExtension(manager_path()) File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in _lazy_call(_check_capability) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/cuda/__init__.py", line 244, in _lazy_call _queued_calls.append((callable, traceback.format_stack()))
[2025-01-15 22:44:57,235] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2018871) of binary: /home/nam123/.conda/envs/py39/bin/python Traceback (most recent call last): File "/home/nam123/.conda/envs/py39/bin/torchrun", line 8, in sys.exit(main()) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/nam123/.conda/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_dist.py FAILED ------------------------------------------------------------a Failures: [1]: time : 2025-01-15_22:44:57 host : sg049.sol.rc.asu.edu rank : 1 (local_rank: 1) exitcode : 1 (pid: 2018872) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-15_22:44:57 host : sg049.sol.rc.asu.edu rank : 0 (local_rank: 0) exitcode : 1 (pid: 2018871) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [/code] Ich habe LINK überprüft und keine Lösung hat geholfen
Ich bin neu in der KI-Entwicklung und ich versuche, ein Modell zu trainieren. Die Sache ist, dass es zwei AMD-GPUs auf dem Server gibt, die Radeon Rx 7600 XT sind und die CPU ein Ryzen 9 5900xt...
Wir führen eine vorhandene Anwendung mit wildfly-10.0.0.Final mit Shiro 1.2.2 aus.
Wir haben unsere Anwendung von MySQL 5.6 auf 8.0.40 aktualisiert und verwenden jetzt die neuesten MySQL-Treiber. Die...
Raumbekämpfung funktioniert für mich nicht, alle Ruks befinden sich zusammen in meinem Browser. .osn {
flex-direction: column;
display: flex;
padding-left: 1.5%;
}
Raumbekämpfung funktioniert für mich nicht, alle Ruks befinden sich zusammen in meinem Browser. .osn {
flex-direction: column;
display: flex;
padding-left: 1.5%;
}