Seltsamer PyTorch-Fehler beim verteilten Training auf mehreren GPUs – RuntimeError: Es wurde erwartet, dass sich alle Te

Anonymous · Post by **Anonymous** » 29 Dec 2025, 04:11

Ich versuche, ein verteiltes Training mit DistributedDataParallel (DDP) von PyTorch auf mehreren GPUs durchzuführen, stoße jedoch auf einen RuntimeError, der darauf hinweist, dass sich Tensoren auf verschiedenen Geräten befinden. Dies geschieht während des Vorwärtsdurchlaufs und das Training funktioniert einwandfrei auf einer einzelnen GPU.
Umgebung:

PyTorch-Version: 2.0.1
CUDA-Version: 11.7
Betriebssystem: Ubuntu 22.04
Hardware: 4x NVIDIA RTX 3090 GPUs

Minimal reproduzierbares Beispiel: Hier ist ein minimales Skript, das das Problem reproduziert. Es definiert ein einfaches Modell und einen Dummy-Datenlader.

Code: Select all

import torch
import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc = nn.Linear(10, 10)

def forward(self, x):
return self.fc(x)

def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)

def train(rank, world_size):
setup(rank, world_size)

model = SimpleModel().to(rank)
model = DDP(model, device_ids=[rank])

optimizer = torch.optim.Adam(model.parameters())

# Dummy data loader (on CPU initially)
inputs = torch.randn(4, 10)  # Batch size 4, input size 10
labels = torch.randint(0, 10, (4,))
data_loader = [(inputs, labels)]  # Single batch for minimal example

for epoch in range(1):  # Minimal loop
for batch in data_loader:
inputs, labels = batch
inputs = inputs.to(rank)
labels = labels.to(rank)
optimizer.zero_grad()
outputs = model(inputs)
loss = nn.CrossEntropyLoss()(outputs, labels)
loss.backward()
optimizer.step()

dist.destroy_process_group()

if __name__ == "__main__":
world_size = 4
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

So wird das Skript gestartet: Ich starte das Skript mit dem folgenden Befehl:

Code: Select all

torchrun --nproc_per_node=4 script.py

Vollständige Fehlermeldung: Der Fehler tritt während des Modellvorwärtsdurchlaufs auf (Ausgaben = Modell(Eingaben)). Hier ist der vollständige Traceback:

Code: Select all

Traceback (most recent call last):
File "/path/to/script.py", line 45, in 
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/path/to/script.py", line 35, in train
outputs = model(inputs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/path/to/script.py", line 14, in forward
return self.fc(x)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA__addmm)

Ich habe sichergestellt, dass Torch.cuda.set_device(rank) aufgerufen wird und Daten explizit auf das Gerät verschoben werden. Das Problem besteht nur im Multi-GPU-Setup weiterhin. Was könnte die Ursache dafür sein und wie kann ich es beheben? Vielleicht hat etwas mit der DDP-Synchronisierung oder der Handhabung von Datenladegeräten zu tun?

Seltsamer PyTorch-Fehler beim verteilten Training auf mehreren GPUs – RuntimeError: Es wurde erwartet, dass sich alle Te

Seltsamer PyTorch-Fehler beim verteilten Training auf mehreren GPUs – RuntimeError: Es wurde erwartet, dass sich alle Te ⇐ Python

Quick Reply