Ich mache mein erstes neuronales Netzwerk /künstliche Intelligenz. Um den Code erfolgreich auszuführen, ohne Fehler (die ich sehen kann), aber leider ist die Validierung und allgemeine Genauigkeit auf ~ 37,5%begrenzt. > 128; lieferte auch größere Henochs/trainiert für längere Zeitspanne, aber ohne Erfolg. 'Verwenden von RNN/CNNs zur Verbesserung der Spracherkennung in Krankenhausumgebungen; Ich möchte das Modell trainieren, um Krankenhaus- und Maschinengeräusche herauszufiltern und gleichzeitig seine Fähigkeit zu verbessern, unterschiedliche Akzente zu erkennen und zwischen medizinischen Wörtern wie 'Hypo' und 'Hyper' < /p>
zu unterscheiden Ich benutze eine Google Public Dataset of Audio, das 105.000 Audiodateien mit 35 Wörtern enthielt. Ich habe dies nur in den Code integriert, aber ich möchte Krankenhaus -Umgebungs -Hintergrundgeräusche und mehr „medizinisch verwandte“ Wörter hinzufügen.
Link ist für den Code und das Video hier />
https://www.tensorflow.org/datasets/cat ... h_commands
Dies ist der Code, den ich in seiner Gesamtheit habe, unter dem ich Google Colab durchläufe. Wenn jemand Vorschläge oder Empfehlungen hat, um mich in die richtige Richtung zu lenken, würde ich es aufrichtig zu schätzen wissen! < /p>
(Ich benutze Pip, um eine Vielzahl von Packs zu installieren, aber ich werde das hier nicht einschließen)
(Ich importiere auch eine Liste von Packs, die ich auch hier nicht für den Raum einbeziehe) < /p> < Br />dataset_path = '/content/drive/MyDrive/RH/speech_command_dataset.zip'
with zipfile.ZipFile(dataset_path, 'r') as zip_ref:
zip_ref.extractall('extracted_dataset')
dataset_path = '/content/extracted_dataset/speech_command_dataset'
for name in listdir(dataset_path):
if isdir(join(dataset_path, name)):
print(name)
all_targets = [name for name in listdir(dataset_path) if isdir(join(dataset_path, name))]
print(all_targets)
all_targets.remove('_background_noise_')
print(all_targets)
< /code>
num_samples = 0
for target in all_targets:
print(len(listdir(join(dataset_path, target))))
num_samples += len(listdir(join(dataset_path, target)))
print('Total samples:', num_samples)
#settings for the samples #
target_list = all_targets
print(target_list)
feature_sets_file = 'all_targets_mfcc_sets.npz'
perc_keep_samples = 1.0 #1.0 is all samples #
val_ratio = 0.3 # now 30%
test_ratio = 0.3
sample_rate = 8000 # original is 16000kHz
num_mfcc = 16
len_mfcc = 16
< /code>
filenames = []
y = []
for index, target in enumerate(target_list):
print(join(dataset_path, target))
filenames.append(listdir(join(dataset_path, target)))
y.append(np.ones(len(filenames[index])) * index )
print(y)
for item in y:
print(len(item))
filenames = [item for sublist in filenames for item in sublist]
y = [item for sublist in y for item in sublist]
filenames_y = list(zip(filenames, y))
random.shuffle(filenames_y)
filenames, y = zip(*filenames_y)
print(len(filenames))
filenames = filenames[:int(len(filenames) * perc_keep_samples)]
print(len(filenames))
< /code>
val_set_size = int(len(filenames) * val_ratio)
test_set_size = int(len(filenames) * test_ratio)
# below is taking a small slice, fourier transforming it #
# applying filters to the spectrum ~26 filters for voice #
# get the mfcc frequency value and then log each value to receive a discrete cosine transform #
# this model will have 16 elements, typically it is 12 #
# now we will have an array / image we can display #
# y axis is mfcc x is time #
# the colours represent mfcc values that the model will be able to differentiate #
def calc_mfcc(path):
signal, fs = librosa.load(path, sr = sample_rate)
mfccs = python_speech_features.base.mfcc(signal,
samplerate = fs,
winlen = 0.256,
winstep = 0.050,
numcep = num_mfcc,
nfilt = 26,
nfft = 2048, # dependant on size of window #
preemph = 0.0, # disabled
ceplifter = 0, # helps with noise, but disabled right now #
appendEnergy = False,
winfunc = np.hanning) # this prevents unwanted artifacts from the fourier transform #
return mfccs.transpose()
< /code>
# taking first 2000 samples #
prob_cnt = 0
x_test = []
y_test = []
for index, filename in enumerate(filenames_train):
if index >= 2000:
break
path = join(dataset_path, target_list[int(y_orig_train[index])],
filename)
mfccs = calc_mfcc(path)
if mfccs.shape[1] == len_mfcc:
x_test.append(mfccs)
y_test.append(y_orig_train[index])
else:
print('Dropped:', index, mfccs.shape)
prob_cnt += 1
# each audio file should produce 16/16, if they dont there is something wrong with it #
# therefore we drop it / cannot use it #
< /code>
print('% of problematic samples:', prob_cnt / 2000)
# around 8% of audio samples we have are not usable/corrupt #
# next part of the code is texting some samples with the mfcc image #
idx = 4 # this is adjusted this value corresponds to the word i.e 1 is background, 2 is bird #
path = join(dataset_path, target_list[int(y_orig_train[idx])],
filenames_train[idx])
print("File path:", path)
if os.path.exists(path):
print("File exists.")
else:
print("File not found at the specified path.")
< /code>
# creating the mfccs #
mfccs = calc_mfcc(path)
print('MFCCs:', mfccs)
# plotting the mfccs #
fig = plt.figure()
plt.imshow(mfccs, cmap = 'inferno', origin = 'lower')
# below plays the problem sounds #
print(target_list[int(y_orig_train[idx])])
try:
playsound(path)
except Exception as e:
print(f"Error playing sound: {e}")
print("Ensure the file format is supported by playsound (primarily WAV) and that you have the necessary dependencies (e.g., pygobject for MP3 playback).")
print("If the file is not a WAV file, consider converting it to WAV using a tool like ffmpeg.")
< /code>
enter image description here
# keeping only the mfccs with the desired/optimal lengths #
def extract_features(in_files, in_y):
prob_cnt = 0
out_x = []
out_y = []
for index, filename in enumerate(in_files):
path = join(dataset_path, target_list[int(in_y[index])],
filename)
if not path.endswith('.wav'):
continue
mfccs = calc_mfcc(path)
if mfccs.shape[1] == len_mfcc:
out_x.append(mfccs)
out_y.append(in_y[index])
else:
print('Dropped:', index, mfccs.shape)
prob_cnt += 1
return out_x, out_y, prob_cnt
# we are now removing invalid data #
x_train, y_train, prob = extract_features(filenames_train,
y_orig_train)
print('Removed percentage:', prob / len(y_orig_train))
x_val, y_val, prob = extract_features(filenames_val, y_orig_val)
print('Removed percentage:', prob / len(y_orig_val))
x_test, y_test, prob = extract_features(filenames_test, y_orig_test)
print('Removed percentage:', prob / len(y_orig_test))
# this keeps our saved data and labels #
np.savez(feature_sets_file,
x_train=x_train,
y_train=y_train,
x_val=x_val,
y_val=y_val,
x_test=x_test,
y_test=y_test)
feature_sets = np.load(feature_sets_file)
feature_sets.files
# to call this just do numpy.load #
len(feature_sets['x_train'])
print(feature_sets['y_val'])
feature_sets_path = '/content'
feature_sets_filename = 'all_targets_mfcc_sets.npz'
model_filename = 'speech_recognition_model.h5'
feature_sets = np.load(join(feature_sets_path, feature_sets_filename))
print(feature_sets.files)
x_train = feature_sets['x_train']
y_train = feature_sets['y_train']
x_val = feature_sets['x_val']
y_val = feature_sets['y_val']
x_test = feature_sets['x_test']
y_test = feature_sets['y_test']
# Reshape the input data before calculating sample_shape
x_train = x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
x_val = x_val.reshape(x_val.shape[0], x_val.shape[1], x_val.shape[2], 1)
x_test = x_test.reshape(x_test.shape[0], x_test.shape[1], x_test.shape[2], 1)
# Now calculate sample_shape after reshaping
sample_shape = x_test.shape[1:]
model = models.Sequential()
model.add(layers.Conv2D(64, # 32 nodes, now 64
(2, 2),
activation='relu', # sets negative numbers to 0 adds a level of nonlinearity
input_shape=sample_shape))
model.add(layers.MaxPooling2D(pool_size=(2, 2))) # reduces size of above feature maps by down sampling #
# helps the neural network to identify where features are supposed to approximately be in the image #
model.add(layers.Conv2D(64, (2, 2), activation='relu')) # repeating these steps #
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Conv2D(128, (2, 2), activation='relu')) # final layer has 128 nodes instead of 64 #
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
# these extracted features are then input to the neural network which classifies the image based on the info provided #
# expects a vector / 1D input #
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu')) # 64 nodes
model.add(layers.Dropout(0.5)) # randomly removes inputs to the next layer
model.add(layers.Dense(1, activation='sigmoid')) # outputs only 1 node which is a prediction #
model.summary()
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
history = model.fit(x_train,
y_train,
epochs=50,
batch_size=32,
validation_data=(x_val, y_val))
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
< /code>
enter image description here
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
< /code>
enter image description here
enter image description here
tf.keras.models.save_model(model, 'speech_recognition_model.keras')
model_filename = 'speech_recognition_model.keras'
model = models.load_model(model_filename)
for i in range(100, 110):
print('Answer:', y_test, ' Prediction:', model.predict(np.expand_dims(x_test, 0)))
Evaluate model with test set
model.evaluate(x=x_test, y=y_test)
< /code>
I have tried to add layers to the CNN.
I checked the data extraction to ensure that was working, which I believe it is.
I tried increasing the number of nodes within the model.
I tried extending training time of the model and increased its epochs.
I tried using a smaller set of data incase the original dataset is too large (no effect.
I am expecting the accuracy of the model to not stagnate at ~37% mark, no matter how much I train it, the accuracy doesn't change.
I hope to have a basic model created that can differentiate between the 35 words in the dataset so that I can add more words, and background noise to continuously train and improve the model.
Gibt es eine Möglichkeit, die Genauigkeit meiner CNNs zu verbessern? ⇐ Python
-
- Similar Topics
- Replies
- Views
- Last post