Kohya-SS SDXL Lora Training setzt Schritte trotz erfolgreicher staatlicher Belastung zurück

Anonymous · Post by **Anonymous** » 05 Oct 2025, 18:42

Ich leite SDXL Lora-Training mit Kohyas SD-Skripten und beschleunigen. Ich habe aktiviert -Save_State und ich versuche, das Training wieder aufzunehmen, aber die Trainingsschritte werden immer auf 0 zurückgesetzt, obwohl das Protokoll angibt, dass die Statusdateien geladen werden. /Workspace/kohya_ss/sd-scripts/sdxl_train_network.py
Optimierer: prodigy
Lebenslauf/Ausgabe Pfade:/Arbeitsspace/lora/lora_output/navya/(bestätigte Existenzierung der letzten Staffel-Direktorie in diesem Weg). /> Python Training Setup < /p>

Code: Select all

from Tools.check_dependencys import check_dependencies
from Tools.makedir import makedir
import os

print("====================================")
print("🚀 LoRA Training with Realviz XL 5.0")
print("Checking Dependencies...")
print("====================================")

check_dependencies()

print("====================================")
print("🚀 Setting up directories...")
print("====================================")

DATASET_DIR, REG_DIR ,LOG_DIR = makedir()
OUTPUT_DIR = "/workspace/LoRA/LoRA_Output/Navya/"

print("====================================")
print("Training Configuration")
print("====================================")

PRETRAINED_MODEL = "/workspace/RealVisXL_V5.0"

# ----------------------
# RESUMPTION PARAMETERS
# ----------------------
# The directory where the training state was saved (usually the output_dir)
# IMPORTANT: This directory must contain the 'last-state' or 'best-state' subfolder.
# You chose '--save_state', so the directory will be in your OUTPUT_DIR.
RESUME_PATH = f"/workspace/LoRA/LoRA_Output/Navya/at-step00000500-state"
STARTING_STEP = 500 # Your current step count

# ----------------------
# TRAINING COMMAND
# ----------------------

RESOLUTION      = 1024
BATCH_SIZE      = 4
GRAD_ACC_STEPS  = 1
MAX_STEPS       = 600
NETWORK_DIM     = 96
NETWORK_ALPHA   = 96
LEARNING_RATE   = 0.7

# ----------------------
# TRAINING COMMAND
# ----------------------

train_cmd = f'''
accelerate launch --mixed_precision=bf16 /workspace/kohya_ss/sd-scripts/sdxl_train_network.py \\
--pretrained_model_name_or_path="{PRETRAINED_MODEL}" \\
--train_data_dir={DATASET_DIR} \\
--reg_data_dir="{REG_DIR}" \\
--output_dir="{OUTPUT_DIR}" \\
--logging_dir="{LOG_DIR}" \\
--resolution={RESOLUTION} \\
--network_module=networks.lora \\
--network_dim={NETWORK_DIM} \\
--network_alpha={NETWORK_ALPHA} \\
--learning_rate={LEARNING_RATE} \\
--train_batch_size={BATCH_SIZE} \\
--gradient_accumulation_steps={GRAD_ACC_STEPS} \\
--max_train_steps={MAX_STEPS} \\
--save_every_n_steps=150 \\
--text_encoder_lr=0.7 \\
--noise_offset=0.1 \\
--min_snr_gamma=5 \\
--save_last_n_steps=3 \\
--save_last_n_epochs=3 \\
--save_state \\
--save_precision=bf16 \\
--optimizer_type=Prodigy \\
--mem_eff_attn \\
--caption_extension=.txt \\
--max_data_loader_n_workers=4 \\
--log_prefix="LoRA_Logs" \\
--enable_bucket \\
--bucket_reso_steps=64 \\
--log_with tensorboard \\
--resume="{RESUME_PATH}" \\
2>&1 | tee /workspace/train.log
'''

print("====================================")
print("🚀 Starting Training...")
print("====================================")

print("🚀 Resuming LoRA training with Colab Pro A100/L4...\n")
exit_code = os.system(train_cmd)
print("\n✅ Training finished with exit code:", exit_code)
< /code>
Log -Beweise (der Konflikt)
Das Protokoll zeigt, dass der Beschleuniger erfolgreich den Lebenslaufpfad und die Ladestatuskomponenten findet (was impliziert, dass die Modellgewichte und möglicherweise der Optimierer geladen werden), aber der Schrittzähler ist eindeutig zurückzutreten: < /p>

  

2025-10-05 12:50:23 INFO      resume training from local state: /workspace/LoRA/LoRA_Output/Navya/     train_util.py:4684
INFO      Loading states from /workspace/LoRA/LoRA_Output/Navya/                   accelerator.py:3678
INFO      All model weights loaded successfully
INFO      All optimizer states loaded successfully
# NOTE: A successful resume log here would usually show the loaded current_step from train_state.json
< /code>
< /li>
  Fortschrittsbalken und Epoch -Zählerwesen: < /p>
epoch 0/700 # Should be current_epoch: X
...
steps: 0%|          | 3/700 [00:20
< /li>
< /ol>
Komplette Protokolle: < /p>
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `1`
`--num_machines` was set to a value of `1`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
2025-10-05 11:50:55 INFO     Using DreamBooth method.                                                                   train_network.py:517
INFO     prepare images.                                                                              train_util.py:2072
INFO     get image size from name of cache files                                                      train_util.py:1965

0%|          | 0/172 [00:00

Kohya-SS SDXL Lora Training setzt Schritte trotz erfolgreicher staatlicher Belastung zurück

Kohya-SS SDXL Lora Training setzt Schritte trotz erfolgreicher staatlicher Belastung zurück ⇐ Python

Quick Reply