Anonymous
Satzstück erzeugt keine Modelle nach der Vorverarbeitung (gelöst)
Post
by Anonymous » 06 Apr 2025, 20:55
Dies ist also das Protokoll, das ich am Terminal sehe: < /p>
Code: Select all
sentencepiece_trainer.cc(78) LOG(INFO) Starts training with :
trainer_spec {
input: C:\Users\xxxx\OneDrive\Documents\Projects\py\xxxxx\data\tokenizer\final_text_corpus.txt
input_format:
model_prefix: C:\Users\xxxx\OneDrive\Documents\Projects\py\xxxxxxxx\tokenizer\multilingual_unigram
model_type: UNIGRAM
vocab_size: 50000
self_test_sample_size: 0
character_coverage: 1
input_sentence_size: 10000000
shuffle_input_sentence: 1
seed_sentencepiece_size: 1000000
shrinking_factor: 0.75
max_sentence_length: 16384
num_threads: 16
num_sub_iterations: 2
max_sentencepiece_length: 16
split_by_unicode_script: 1
split_by_number: 1
split_by_whitespace: 1
split_digits: 0
pretokenization_delimiter:
treat_whitespace_as_suffix: 0
allow_whitespace_only_pieces: 0
user_defined_symbols:
required_chars:
byte_fallback: 0
vocabulary_output_piece_score: 1
train_extremely_large_corpus: 1
seed_sentencepieces_file:
hard_vocab_limit: 1
use_all_vocab: 0
unk_id: 1
bos_id: 2
eos_id: 3
pad_id: 0
unk_piece:
bos_piece:
eos_piece:
pad_piece:
unk_surface: Γüç
enable_differential_privacy: 0
differential_privacy_noise_level: 0
differential_privacy_clipping_threshold: 0
}
normalizer_spec {
name: nmt_nfkc
add_dummy_prefix: 1
remove_extra_whitespaces: 1
escape_whitespaces: 1
normalization_rule_tsv:
}
denormalizer_spec {}
trainer_interface.cc(353) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.
trainer_interface.cc(185) LOG(INFO) Loading corpus: C:\Users\xxxxxxx\OneDrive\Documents\Projects\py\xxxxxxx\data\tokenizer\final_text_corpus.txt
trainer_interface.cc(147) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(147) LOG(INFO) Loaded 2000000 lines
trainer_interface.cc(147) LOG(INFO) Loaded 3000000 lines
trainer_interface.cc(147) LOG(INFO) Loaded 4000000 lines
trainer_interface.cc(147) LOG(INFO) Loaded 5000000 lines
trainer_interface.cc(124) LOG(WARNING) Too many sentences are loaded! (5816781), which may slow down training.
trainer_interface.cc(126) LOG(WARNING) Consider using --input_sentence_size= and --shuffle_input_sentence=true.
trainer_interface.cc(129) LOG(WARNING) They allow to randomly sample sentences from the entire corpus.
trainer_interface.cc(409) LOG(INFO) Loaded all 5816781 sentences
trainer_interface.cc(425) LOG(INFO) Adding meta_piece:
trainer_interface.cc(425) LOG(INFO) Adding meta_piece:
trainer_interface.cc(425) LOG(INFO) Adding meta_piece:
trainer_interface.cc(425) LOG(INFO) Adding meta_piece:
trainer_interface.cc(425) LOG(INFO) Adding meta_piece:
trainer_interface.cc(430) LOG(INFO) Normalizing sentences...
trainer_interface.cc(539) LOG(INFO) all chars count=731130164
trainer_interface.cc(550) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(560) LOG(INFO) Alphabet size=1280
trainer_interface.cc(561) LOG(INFO) Final character coverage=1
trainer_interface.cc(592) LOG(INFO) Done! preprocessed 5816741 sentences.
< /code>
Danach schließt das Terminal.import sys
import os
from pathlib import Path
import sentencepiece as spm
# === Paths ===
root_dir = Path(__file__).resolve().parent.parent
input_path = root_dir / "data" / "tokenizer" / "final_text_corpus.txt"
output_dir = root_dir / "tokenizer"
output_dir.mkdir(parents=True, exist_ok=True)
model_prefix = "spm_tokenizer"
log_path = output_dir / "training.log"
# === Logging setup ===
with open(log_path, "w", encoding="utf-8") as log_file:
sys.stdout = log_file
sys.stderr = log_file
print("Starting tokenizer training...")
print(f"Input corpus: {input_path}")
print(f"Output prefix: {model_prefix}")
print(f"Vocab size: {50000}")
# === Train Tokenizer ===
spm.SentencePieceTrainer.train(
input=str(input_path),
model_prefix=model_prefix,
vocab_size=50000,
model_type="unigram",
character_coverage=1.0,
pad_id=0,
unk_id=1,
bos_id=2,
eos_id=3,
user_defined_symbols=[""],
train_extremely_large_corpus=True,
input_sentence_size=10_000_000,
shuffle_input_sentence=True,
max_sentence_length=16384
)
print(f"Tokenizer trained! Model saved to: {model_prefix}.model / .vocab")
Ich habe das Training_extremely_large_corpus Flag zu True und die input_Sentce_Size und max_Sentce_Length hinzugefügt.>
1743965757
Anonymous
Dies ist also das Protokoll, das ich am Terminal sehe: < /p> [code]sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : trainer_spec { input: C:\Users\xxxx\OneDrive\Documents\Projects\py\xxxxx\data\tokenizer\final_text_corpus.txt input_format: model_prefix: C:\Users\xxxx\OneDrive\Documents\Projects\py\xxxxxxxx\tokenizer\multilingual_unigram model_type: UNIGRAM vocab_size: 50000 self_test_sample_size: 0 character_coverage: 1 input_sentence_size: 10000000 shuffle_input_sentence: 1 seed_sentencepiece_size: 1000000 shrinking_factor: 0.75 max_sentence_length: 16384 num_threads: 16 num_sub_iterations: 2 max_sentencepiece_length: 16 split_by_unicode_script: 1 split_by_number: 1 split_by_whitespace: 1 split_digits: 0 pretokenization_delimiter: treat_whitespace_as_suffix: 0 allow_whitespace_only_pieces: 0 user_defined_symbols: required_chars: byte_fallback: 0 vocabulary_output_piece_score: 1 train_extremely_large_corpus: 1 seed_sentencepieces_file: hard_vocab_limit: 1 use_all_vocab: 0 unk_id: 1 bos_id: 2 eos_id: 3 pad_id: 0 unk_piece: bos_piece: eos_piece: pad_piece: unk_surface: Γüç enable_differential_privacy: 0 differential_privacy_noise_level: 0 differential_privacy_clipping_threshold: 0 } normalizer_spec { name: nmt_nfkc add_dummy_prefix: 1 remove_extra_whitespaces: 1 escape_whitespaces: 1 normalization_rule_tsv: } denormalizer_spec {} trainer_interface.cc(353) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator. trainer_interface.cc(185) LOG(INFO) Loading corpus: C:\Users\xxxxxxx\OneDrive\Documents\Projects\py\xxxxxxx\data\tokenizer\final_text_corpus.txt trainer_interface.cc(147) LOG(INFO) Loaded 1000000 lines trainer_interface.cc(147) LOG(INFO) Loaded 2000000 lines trainer_interface.cc(147) LOG(INFO) Loaded 3000000 lines trainer_interface.cc(147) LOG(INFO) Loaded 4000000 lines trainer_interface.cc(147) LOG(INFO) Loaded 5000000 lines trainer_interface.cc(124) LOG(WARNING) Too many sentences are loaded! (5816781), which may slow down training. trainer_interface.cc(126) LOG(WARNING) Consider using --input_sentence_size= and --shuffle_input_sentence=true. trainer_interface.cc(129) LOG(WARNING) They allow to randomly sample sentences from the entire corpus. trainer_interface.cc(409) LOG(INFO) Loaded all 5816781 sentences trainer_interface.cc(425) LOG(INFO) Adding meta_piece: trainer_interface.cc(425) LOG(INFO) Adding meta_piece: trainer_interface.cc(425) LOG(INFO) Adding meta_piece: trainer_interface.cc(425) LOG(INFO) Adding meta_piece: trainer_interface.cc(425) LOG(INFO) Adding meta_piece: trainer_interface.cc(430) LOG(INFO) Normalizing sentences... trainer_interface.cc(539) LOG(INFO) all chars count=731130164 trainer_interface.cc(550) LOG(INFO) Done: 100% characters are covered. trainer_interface.cc(560) LOG(INFO) Alphabet size=1280 trainer_interface.cc(561) LOG(INFO) Final character coverage=1 trainer_interface.cc(592) LOG(INFO) Done! preprocessed 5816741 sentences. < /code> Danach schließt das Terminal.import sys import os from pathlib import Path import sentencepiece as spm # === Paths === root_dir = Path(__file__).resolve().parent.parent input_path = root_dir / "data" / "tokenizer" / "final_text_corpus.txt" output_dir = root_dir / "tokenizer" output_dir.mkdir(parents=True, exist_ok=True) model_prefix = "spm_tokenizer" log_path = output_dir / "training.log" # === Logging setup === with open(log_path, "w", encoding="utf-8") as log_file: sys.stdout = log_file sys.stderr = log_file print("Starting tokenizer training...") print(f"Input corpus: {input_path}") print(f"Output prefix: {model_prefix}") print(f"Vocab size: {50000}") # === Train Tokenizer === spm.SentencePieceTrainer.train( input=str(input_path), model_prefix=model_prefix, vocab_size=50000, model_type="unigram", character_coverage=1.0, pad_id=0, unk_id=1, bos_id=2, eos_id=3, user_defined_symbols=[""], train_extremely_large_corpus=True, input_sentence_size=10_000_000, shuffle_input_sentence=True, max_sentence_length=16384 ) print(f"Tokenizer trained! Model saved to: {model_prefix}.model / .vocab") [/code] Ich habe das Training_extremely_large_corpus Flag zu True und die input_Sentce_Size und max_Sentce_Length hinzugefügt.>