Was ist in diesem AI-Modellfeinum los?Linux

Linux verstehen
Anonymous
 Was ist in diesem AI-Modellfeinum los?

Post by Anonymous »

Ich verfolge die AMD-ROCM-Trainingsdokumentation, ich habe Docker bereits konfiguriert und alle notwendigen Abhängigkeiten installiert. Sie sprechen auch über KI-Modelltraining mit ihren eigenen Dateien, die diese Dateien "Rezepte" nennen, in denen sie sagen, dass diese Trainingsrezepts einsatzbereit sind. ist mein benutzerdefiniertes Rezept: < /p>
output_dir: /workspace/notebooks/result/ # /tmp may be deleted by your system. Change it to your preference.

# Tokenizer
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
path: /workspace/notebooks/modello-preaddestrato/original/tokenizer.model
max_seq_len: null

# Dataset
dataset:
_component_: torchtune.datasets.chat_dataset
source: /workspace/notebooks/datasets/dataset.json
packed: False # True increases speed
conversation_column: conversations
conversation_style: chatml
seed: 42
shuffle: True

# Model Arguments
model:
_component_: torchtune.models.llama3_1.llama3_1_8b

checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /workspace/notebooks/modello-preaddestrato/
checkpoint_files: [
model-00001-of-00004.safetensors,
model-00002-of-00004.safetensors,
model-00003-of-00004.safetensors,
model-00004-of-00004.safetensors
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: LLAMA3
resume_from_checkpoint: False

# Fine-tuning arguments
batch_size: 2
epochs: 2

optimizer:
_component_: torch.optim.AdamW
lr: 2e-5
fused: True
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
clip_grad_norm: null
compile: False # torch.compile the model + loss, True increases speed + decreases memory
optimizer_in_bwd: False # True saves memory. Requires gradient_accumulation_steps=1
gradient_accumulation_steps: 4 # Use to increase effective batch size

# Training env
device: cuda

# Memory management
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: False # True reduces memory
custom_sharded_layers: ['tok_embeddings', 'output'] # Layers to shard separately (useful for large vocab size models). Lower Memory, but lower speed.

# Reduced precision
dtype: bf16

# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True

# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False

#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs

#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True

#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False

# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1
< /code>
Ich denke, ich bin festgefahren, sobald ich das Training beginne, denn während ich den Befehl ausführe: < /p>
tune run --nproc_per_node 2 full_finetune_distributed --config /workspace/notebooks/my_custom_config_distributed.yaml
< /code>
Die Terminalprotokolldatei in der YAML -Datei und andere Informationen und Debugging -Protokolle wie folgt: < /p>
Running with torchrun...
W0411 10:51:13.143000 9895 site-packages/torch/distributed/run.py:766]
W0411 10:51:13.143000 9895 site-packages/torch/distributed/run.py:766]
*****************************************
W0411 10:51:13.143000 9895 site-packages/torch/distributed/run.py:766] Setting
OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your
system being overloaded, please further tune the variable for optimal performance in your application as needed.

W0411 10:51:13.143000 9895 site-packages/torch/distributed/run.py:766]
*****************************************

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 2
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /workspace/notebooks/modello-preaddestrato/
checkpoint_files:
- model-00001-of-00004.safetensors
- model-00002-of-00004.safetensors
- model-00003-of-00004.safetensors
- model-00004-of-00004.safetensors
model_type: LLAMA3
output_dir: /workspace/notebooks/result/
recipe_checkpoint: null
clip_grad_norm: null
compile: false
custom_sharded_layers:
- tok_embeddings
- output
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: chatml
packed: false
source: /workspace/notebooks/datasets/dataset.json
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 2
gradient_accumulation_steps: 4
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /workspace/notebooks/result//logs
model:
_component_: torchtune.models.llama3_1.llama3_1_8b
optimizer:
_component_: torch.optim.AdamW
fused: true
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /workspace/notebooks/result/
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: /workspace/notebooks/result//profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: 42
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /workspace/notebooks/modello-preaddestrato/original/tokenizer.model

INFO:torchtune.utils._logging:Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 2
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /workspace/notebooks/modello-preaddestrato/
checkpoint_files:
- model-00001-of-00004.safetensors
- model-00002-of-00004.safetensors
- model-00003-of-00004.safetensors
- model-00004-of-00004.safetensors
model_type: LLAMA3
output_dir: /workspace/notebooks/result/
recipe_checkpoint: null
clip_grad_norm: null
compile: false
custom_sharded_layers:
- tok_embeddings
- output
dataset:
_component_: torchtune.datasets.chat_dataset
conversation_column: conversations
conversation_style: chatml
packed: false
source: /workspace/notebooks/datasets/dataset.json
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 2
gradient_accumulation_steps: 4
log_every_n_steps: 1
log_peak_memory_stats: true
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /workspace/notebooks/result//logs
model:
_component_: torchtune.models.llama3_1.llama3_1_8b
optimizer:
_component_: torch.optim.AdamW
fused: true
lr: 2.0e-05
optimizer_in_bwd: false
output_dir: /workspace/notebooks/result/
profiler:
_component_: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: /workspace/notebooks/result//profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: 42
shuffle: true
tokenizer:
_component_: torchtune.models.llama3.llama3_tokenizer
max_seq_len: null
path: /workspace/notebooks/modello-preaddestrato/original/tokenizer.model
INFO:torchtune.utils._logging:Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. Enabling activation offloading should reduce memory further.
DEBUG:torchtune.utils._logging:Setting manual seed to local seed 42. Local seed is seed + rank = 42 + 0
Writing logs to /workspace/notebooks/result/logs/log_1744368681.txt
INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
< /code>
Das Problem ist, dass das Terminal 4 Stunden lang aufgehört hat. Das letzte Protokoll lautet: < /p>
INFO:torchtune.utils._logging:Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
< /code>
Ohne etwas anderes zu protokollieren, gibt es buchstäblich nichts unter dem letzten Protokoll, nicht einmal die Terminal -Eingabeaufforderung, was mich glauben lässt, dass es immer noch läuft, aber es ist wahrscheinlich nicht. Ich weiß nicht, ob mir andere Abhängigkeiten fehlt, die installiert werden müssen, und ich habe sie nicht oder es ist etwas, das mit ROCM oder einigen Umgebungsvariablen im Docker konfiguriert werden muss. Dies sind die Komponenten, die ich auf dem Server habe:
cpu: amd ryzen 9 5900xt 16-core
gpu: amd radeon ™ rx 7600 xt 16 gb vram-> Ich habe zwei von ihnen.>

Quick Reply

Change Text Case: 
   
  • Similar Topics
    Replies
    Views
    Last post