Code: Select all
from unsloth import FastLanguageModel
< /code>
Dann lade ich das LLAMA3 -Modell. < /p>
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
< /code>
Ich leite mein Skript auf CS -Code aus, und mein Python und mein Skript sind auf WSL. Meine Systeminformationen lautet wie unten: < /p>
==((====))== Unsloth: Fast Llama patching release 2024.5
\\ /| GPU: NVIDIA GeForce RTX 4090. Max memory: 23.988 GB. Platform = Linux.
O^O/ \_/ \ Pytorch: 2.1.0+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\ / Bfloat16 = TRUE. Xformers = 0.0.22.post7. FA = False.
"-____-" Free Apache license: http://github.com/unslothai/unsloth
< /code>
Jetzt tritt ich auf diesen Fehler ein: < /p>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[4], line 2
1 # 2. Load Llama3 model
----> 2 model, tokenizer = FastLanguageModel.from_pretrained(
3 model_name = "unsloth/llama-3-8b-bnb-4bit",
4 max_seq_length = max_seq_length,
5 dtype = None,
6 load_in_4bit = True,
7 )
File ~/miniconda/envs/llama3/lib/python3.9/site-packages/unsloth/models/loader.py:142, in FastLanguageModel.from_pretrained(model_name, max_seq_length, dtype, load_in_4bit, token, device_map, rope_scaling, fix_tokenizer, trust_remote_code, use_gradient_checkpointing, resize_model_vocab, *args, **kwargs)
139 tokenizer_name = None
140 pass
--> 142 model, tokenizer = dispatch_model.from_pretrained(
143 model_name = model_name,
144 max_seq_length = max_seq_length,
145 dtype = dtype,
146 load_in_4bit = load_in_4bit,
147 token = token,
148 device_map = device_map,
149 rope_scaling = rope_scaling,
150 fix_tokenizer = fix_tokenizer,
151 model_patcher = dispatch_model,
152 tokenizer_name = tokenizer_name,
...
96 "You have a version of `bitsandbytes` that is not compatible with 4bit inference and training"
97 " make sure you have the latest version of `bitsandbytes` installed"
98 )
ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.
Würde jemand bitte helfen?