本文章大部分參考https://www.kaggle.com/fizzbuzz/beginner-s-guide-to-audio-data/notebook,整理了一下自己學(xué)習(xí)這篇文章的思路,并根據(jù)學(xué)習(xí)的流程修改了一下代碼。
簡介
本篇文章要做的內(nèi)容來源于Kaggle上的比賽
Freesound General-Purpose Audio Tagging Challenge:Can you automatically recognize sounds from a wide range of real-world environments?
類似于圖片分類,這里的任務(wù)是根據(jù)分類出該音頻是鼓聲、嬰兒的笑聲還是鼓風(fēng)機聲音等等。不同于圖片分類的是,音頻處理不像圖像那樣直觀,且是隨時間變化的一個序列,下面就來學(xué)習(xí)一下應(yīng)該怎么處理音頻數(shù)據(jù)吧。
數(shù)據(jù)集下載
數(shù)據(jù)集從https://www.kaggle.com/c/freesound-audio-tagging/data下載,不過數(shù)據(jù)量比較大(7GB),且需要Kaggle帳號登錄,建議從瀏覽器開始下載,然后復(fù)制鏈接,扔到迅雷里面下載。數(shù)據(jù)分為:
- train.csv
描述了每個wav文件對應(yīng)的ID,以及它的分類,還有該分類標(biāo)注是否經(jīng)過人工審查,大致如下:
fname,label,manually_verified
00044347.wav,Hi-hat,0
001ca53d.wav,Saxophone,1
002d256b.wav,Trumpet,0
0033e230.wav,Glockenspiel,1
00353774.wav,Cello,1
003b91e8.wav,Cello,0
003da8e5.wav,Knock,1
0048fd00.wav,Gunshot_or_gunfire,1
004ad66f.wav,Clarinet,0
- sample_submission.csv
由于最終的評判規(guī)則是看最終輸出的分類里面Top 3的準(zhǔn)確率,提交的結(jié)果包含三個分類結(jié)果。
fname,label
00063640.wav,Laughter Hi-Hat Flute
0013a1db.wav,Laughter Hi-Hat Flute
002bb878.wav,Laughter Hi-Hat Flute
002d392d.wav,Laughter Hi-Hat Flute
00326aa9.wav,Laughter Hi-Hat Flute
0038a046.wav,Laughter Hi-Hat Flute
003995fa.wav,Laughter Hi-Hat Flute
005ae625.wav,Laughter Hi-Hat Flute
007759c4.wav,Laughter Hi-Hat Flute
- train.zip,test.zip
這兩個就是兩個csv所對應(yīng)的wav了。
Version 0 從時域構(gòu)建分類CNN網(wǎng)絡(luò)
因為對音頻處理的流程不熟悉,這里重點關(guān)注代碼編寫的流程。
先導(dǎo)入需要的庫
import librosa
import numpy as np
import scipy
from keras import losses, models, optimizers
from keras.activations import relu, softmax
from keras.callbacks import (EarlyStopping, LearningRateScheduler,
ModelCheckpoint, TensorBoard, ReduceLROnPlateau)
from keras.layers import (Convolution1D, Dense, Dropout, GlobalAveragePooling1D,
GlobalMaxPool1D, Input, MaxPool1D, concatenate)
from keras.utils import Sequence, to_categorical
首先需要確定以下問題:
網(wǎng)絡(luò)結(jié)構(gòu)
從時域來構(gòu)建就是一維的序列輸入了,這里用的是一個1D的CNN網(wǎng)絡(luò),輸入是固定的2s的采樣率為16000Hz的音頻,但我們的數(shù)據(jù)有長有短,需要考慮如何構(gòu)建為該長度的數(shù)據(jù)。
數(shù)據(jù)構(gòu)建
先想一下我們需要配置哪些數(shù)據(jù)(常數(shù)):
- 采樣率
- 送入網(wǎng)絡(luò)的音頻持續(xù)時長
- 最終分類的數(shù)量
- 迭代次數(shù)
- 學(xué)習(xí)率
所以我們寫一個配置類如下
class Config(object):
def __init__(self, sampling_rate=16000, audio_duration=2, n_classes=41,
learning_rate=0.0001,max_epochs=50):
self.sampling_rate = sampling_rate
self.audio_duration = audio_duration
self.n_classes = n_classes
self.learning_rate = learning_rate
self.max_epochs = max_epochs
#送入網(wǎng)絡(luò)的幀數(shù)
self.audio_length = self.sampling_rate * self.audio_duration
#Input的維度
self.dim = (self.audio_length, 1)
然后就是如何從wav文件產(chǎn)生數(shù)據(jù)了,下面內(nèi)容參考:https://blog.csdn.net/m0_37477175/article/details/79716312
在使用keras訓(xùn)練model的時候,一般會將所有的訓(xùn)練數(shù)據(jù)加載到內(nèi)存中,然后喂給網(wǎng)絡(luò),但當(dāng)內(nèi)存有限,且數(shù)據(jù)量過大時,此方法則不再可用。因此我們準(zhǔn)備構(gòu)建一個數(shù)據(jù)迭代器。參考Keras的官方API
Every Sequence must implement the __getitem__ and the __len__ methods. If you want to modify your dataset between epochs you may implement on_epoch_end. The method __getitem__ should return a complete batch.
class DataGenerator(Sequence):
def __init__(self, config, data_dir, list_IDs, labels=None,
batch_size=64, preprocessing_fn=lambda x: x):
self.config = config
self.data_dir = data_dir
# 這里對原代碼做了一些修改,為什么加上list()請看文章后續(xù)分析
self.list_IDs = list(list_IDs)
if(labels is None):
self.labels=None
else:
self.labels = list(labels)
self.batch_size = batch_size
self.preprocessing_fn = preprocessing_fn
# 在on_epoch_end內(nèi)獲得了wav文件共有多少個,便于從ID對應(yīng)到文件名,在每個訓(xùn)練epoch結(jié)束之后也會執(zhí)行該函數(shù)
self.on_epoch_end()
self.dim = self.config.dim
# 以下兩個函數(shù)是必須在Sequence類里實現(xiàn)的方法
# 返回一共有多少個batch
def __len__(self):
return int(np.ceil(len(self.list_IDs) / self.batch_size))
# 返回第index個batch內(nèi)的內(nèi)容
def __getitem__(self, index):
# 這些一個batch內(nèi)的內(nèi)容由indexes指定,例如[a,a+1,a+2,…,a+batch_size]
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# 將indexes映射至ID,組成一個需要處理的文件ID列表
# 需要注意list_IDs如果是Dataframe,用[k]取元素取的是原表里面的元素
list_IDs_temp = [self.list_IDs[k] for k in indexes]
return self.__data_generation(list_IDs_temp)
def on_epoch_end(self):
self.indexes = np.arange(len(self.list_IDs))
def __data_generation(self, list_IDs_temp):
cur_batch_size = len(list_IDs_temp)
# 這里的*是因為self.dim類似shape,是一個多維的參數(shù),通過*當(dāng)成參數(shù)列表傳遞進去
X = np.empty((cur_batch_size, *self.dim))
input_length = self.config.audio_length
for i, ID in enumerate(list_IDs_temp):
file_path = self.data_dir + ID
# Read and Resample the audio
data, _ = librosa.core.load(file_path, sr=self.config.sampling_rate,
res_type='kaiser_fast')
# Random offset / Padding
# 如果音頻過長 從中取出指定長度的音頻 eg:len(data)=5000 input_length=1000
if len(data) > input_length:
# max_offset=4000
max_offset = len(data) - input_length
# offset=3214
offset = np.random.randint(max_offset)
# [3214:1000+3214]
data = data[offset:(input_length+offset)]
else:
# 如果音頻過短,補成指定長度
if input_length > len(data):
max_offset = input_length - len(data)
offset = np.random.randint(max_offset)
else:
offset = 0
data = np.pad(data, (offset, input_length - len(data) - offset), "constant")
# Normalization + Other Preprocessing
# 這里的preprocessing_fn是預(yù)處理函數(shù),對音頻進行歸一化處理,參考后一段代碼
# 增添一個新維度以符合網(wǎng)絡(luò)Input形狀
data = self.preprocessing_fn(data)[:, np.newaxis]
# 第i個X為處理之后的data
X[i,] = data
# 如果帶有l(wèi)abel(train)
if self.labels is not None:
y = np.empty(cur_batch_size, dtype=int)
for i in range(len(list_IDs_temp)):
y[i] = self.labels[i]
#for i, ID in enumerate(list_IDs_temp):
#y[i] = self.labels[ID]
return X, to_categorical(y, num_classes=self.config.n_classes)
#如果不帶label(test)
else:
return X
然后還需要一個對音頻歸一化處理的函數(shù):
def audio_norm(data):
max_data = np.max(data)
min_data = np.min(data)
data = (data-min_data)/(max_data-min_data+1e-6)
return data-0.5
Model合成
def get_1d_conv_model(config):
nclass = config.n_classes
input_length = config.audio_length
inp = Input(shape=(input_length,1))
x = Convolution1D(16, 9, activation=relu, padding="valid")(inp)
x = Convolution1D(16, 9, activation=relu, padding="valid")(x)
x = MaxPool1D(16)(x)
x = Dropout(rate=0.1)(x)
x = Convolution1D(32, 3, activation=relu, padding="valid")(x)
x = Convolution1D(32, 3, activation=relu, padding="valid")(x)
x = MaxPool1D(4)(x)
x = Dropout(rate=0.1)(x)
x = Convolution1D(32, 3, activation=relu, padding="valid")(x)
x = Convolution1D(32, 3, activation=relu, padding="valid")(x)
x = MaxPool1D(4)(x)
x = Dropout(rate=0.1)(x)
x = Convolution1D(256, 3, activation=relu, padding="valid")(x)
x = Convolution1D(256, 3, activation=relu, padding="valid")(x)
x = GlobalMaxPool1D()(x)
x = Dropout(rate=0.2)(x)
x = Dense(64, activation=relu)(x)
x = Dense(1028, activation=relu)(x)
out = Dense(nclass, activation=softmax)(x)
model = models.Model(inputs=inp, outputs=out)
opt = optimizers.Adam(config.learning_rate)
model.compile(optimizer=opt, loss=losses.categorical_crossentropy, metrics=['acc'])
return model
訓(xùn)練
#讀入數(shù)據(jù)
import pandas as pd
train=pd.read_csv('train.csv')
test=pd.read_csv('sample_submission.csv')
#將Index設(shè)為Wav名字
train.set_index("fname", inplace=True)
test.set_index("fname", inplace=True)
然后找出每個wav有多少個采樣點
import wave
train['nframes']=train['fname'].apply(lambda f:
wave.open('audio_train/'+f).getnframes())
test['nframes']=test['fname'].apply(lambda f:
wave.open('audio_test/'+f).getnframes())
然后配置我們之前定義的數(shù)據(jù)迭代器:
train_generator = DataGenerator(config=config,data_dir='audio_train/',
list_IDs=train.index,labels=train.label_idx,
batch_size=64, preprocessing_fn=audio_norm)
開始訓(xùn)練:
config = Config(sampling_rate=16000, audio_duration=2, n_folds=2, max_epochs=5)
history = model.fit_generator(train_generator, epochs=config.max_epochs, use_multiprocessing=True, workers=6, max_queue_size=20)
然后就可以輸出結(jié)果啦。我們來看一下最終結(jié)果是怎樣的:
# Save test predictions
test_generator = DataGenerator(config, 'audio_test/', test['fname'], batch_size=128,
preprocessing_fn=audio_norm)
predictions = model.predict_generator(test_generator, use_multiprocessing=True,
workers=6, max_queue_size=20, verbose=1)
# Make a submission file
# prediction.shape=(len(test),41)
top_3 = np.array(LABELS)[np.argsort(-predictions, axis=1)[:, :3]]
predicted_labels = [' '.join(list(x)) for x in top_3]
test['label'] = predicted_labels
# 雙層括號返回的是DataFrame的形式(帶label的表頭且含index)
test[['label']].to_csv("predictions.csv")
test[['label']].head()
不過以上程序還非常簡陋,只設(shè)置了訓(xùn)練集和測試集,跑出來測試集的結(jié)果直接就扔到Kaggle上提交了,如果沒有對應(yīng)的評分機制就不知道最終的模型效果是怎樣的,所以我們再加上驗證集和別的一些措施。
Version 1 增加可持續(xù)、交叉驗證措施
為了更好地挑選出哪個模型更好,在Version 0 的基礎(chǔ)上將訓(xùn)練拆分成10份,找出其中哪一份是最好的,再添加一些Keras的Callback方法,作用如下:
Callback
We use some Keras callbacks to monitor the training.
- ModelCheckpoint saves the best weight of our model (using validation data). We use this weight to make >test predictions.
- EarlyStopping stops the training once validation loss ceases to decrease
- TensorBoard helps us visualize training and validation loss and accuracy.
pandas中需要注意的一個地方
a=pd.DataFrame({'A':['a','b','c'],'B':[11,12,13],'C':['h','e','h']})
print(a.head())
得到
A B C
0 a 11 h
1 b 12 e
2 c 13 h
然后我們從中取出一段,并取出該段中的第二個元素。
b=a[1:]
b['B'][1]
按說b
中的第二個元素應(yīng)該是13
,可是結(jié)果卻是12
。這是因為=
符號并不是完全拷貝,表b
仍然是表a
的一部分,b['B'][1]
取出來的仍然是表a
里面的index=1
的元素。
- Kaggle上原方案給出的解決方法是設(shè)一個其他的
index
消除掉原有表的數(shù)字index
,這樣用下標(biāo)定位就是準(zhǔn)確的了。
a=pd.DataFrame({'A':['a','b','c'],'B':[11,12,13],'C':['h','e','h']})
a.set_index('A',inplace=True)
print(a.head())
B C
A
a 11 h
b 12 e
c 13 h
- 還可以不讓表
b
再屬于原有的表a
的引用,可以加上一個list()
,即
list(b['B'])[1]
同樣可以輸出正確的結(jié)果13
數(shù)據(jù)拆分
回到正文,我們?yōu)槭裁匆P(guān)注上面這個問題呢?在Version 0的過程中用的是整個數(shù)據(jù)集,而現(xiàn)在則是用交叉驗證從中取出一段數(shù)據(jù),不解決上述問題,從DataGenerator
中按Index
取出數(shù)據(jù)時就會出現(xiàn)KeyError
的報錯。
關(guān)于Kfold
等方法的解釋參考
https://blog.csdn.net/FontThrone/article/details/79220,這里用的是StratifiedKFold
,會讓每一份的y
分布和原數(shù)據(jù)一致。例如原數(shù)據(jù)中出現(xiàn)了10次0,5次1,拆出來的每一份也會是同樣的2:1的比例。代碼如下:
from sklearn.model_selection import StratifiedKFold
import os
import shutil
PREDICTION_FOLDER = "predictions_1d_conv"
if not os.path.exists(PREDICTION_FOLDER):
os.mkdir(PREDICTION_FOLDER)
if os.path.exists('logs/' + PREDICTION_FOLDER):
shutil.rmtree('logs/' + PREDICTION_FOLDER)
# 原方案寫的是skf = StratifiedKFold(train.label_idx, n_splits=config.n_folds),已失效
skf = StratifiedKFold(n_splits=10)
然后對拆出來的每份數(shù)據(jù)集訓(xùn)練模型并預(yù)測結(jié)果(注意這里訓(xùn)練出來了10個模型,得出了10個Test預(yù)測結(jié)果)
for i, (train_split, val_split) in enumerate(skf.split(train['fname'],train['label_idx'])):
#返回的train_split,val_split是Index形式
train_set = train.iloc[train_split]
val_set = train.iloc[val_split]
checkpoint = ModelCheckpoint('best_%d.h5'%i, monitor='val_loss', verbose=1, save_best_only=True)
early = EarlyStopping(monitor="val_loss", mode="min", patience=5)
tb = TensorBoard(log_dir='./logs/' + PREDICTION_FOLDER + '/fold_%d'%i,
write_graph=True)
callbacks_list = [checkpoint, early, tb]
print("Fold: ", i)
print("#"*50)
model = get_1d_conv_model(config)
train_generator = DataGenerator(config, 'audio_train/', train_set['fname'],
train_set['label_idx'], batch_size=64,
preprocessing_fn=audio_norm)
val_generator = DataGenerator(config, 'audio_train/', val_set['fname'],
val_set['label_idx'], batch_size=64,
preprocessing_fn=audio_norm)
history = model.fit_generator(train_generator, callbacks=callbacks_list,
validation_data=val_generator,
epochs=config.max_epochs, use_multiprocessing=True,
workers=6, max_queue_size=20)
# 加載在這一份訓(xùn)練數(shù)據(jù)中出現(xiàn)的最好的模型(val_loss最低)的權(quán)重
model.load_weights('best_%d.h5'%i)
# Save train predictions
# 這里保存每個模型對整個Train的預(yù)測結(jié)果,便于后續(xù)分析
train_generator = DataGenerator(config, 'audio_train/', train['fname'], batch_size=128,
preprocessing_fn=audio_norm)
predictions = model.predict_generator(train_generator, use_multiprocessing=True,
workers=6, max_queue_size=20, verbose=1)
np.save(PREDICTION_FOLDER + "/train_predictions_%d.npy"%i, predictions)
# Save test predictions
# 用每個模型來提交Test結(jié)果,并在后續(xù)進行整合
test_generator = DataGenerator(config, 'audio_test/', test['fname'], batch_size=128,
preprocessing_fn=audio_norm)
predictions = model.predict_generator(test_generator, use_multiprocessing=True,
workers=6, max_queue_size=20, verbose=1)
np.save(PREDICTION_FOLDER + "/test_predictions_%d.npy"%i, predictions)
# Make a submission file
top_3 = np.array(LABELS)[np.argsort(-predictions, axis=1)[:, :3]]
predicted_labels = [' '.join(list(x)) for x in top_3]
test['label'] = predicted_labels
# 雙層括號返回的是DataFrame的形式(帶label的表頭且含index)
test[['label']].to_csv(PREDICTION_FOLDER + "/predictions_%d.csv"%i)
預(yù)測結(jié)果整合
最后我們用幾何平均把這10份結(jié)果整合在一起,作為最終的結(jié)果進行提交。
pred_list = []
for i in range(10):
pred_list.append(np.load("predictions_1d_conv/test_predictions_%d.npy"%i))
# 返回一個[1,1,1,...,1] 的數(shù)組
prediction = np.ones_like(pred_list[0])
# 累乘各個fold的預(yù)測概率
for pred in pred_list:
prediction = prediction*pred
# 取幾何平均
prediction = prediction**(1./len(pred_list))
# Make a submission file
top_3 = np.array(LABELS)[np.argsort(-prediction, axis=1)[:, :3]]
predicted_labels = [' '.join(list(x)) for x in top_3]
test = pd.read_csv('sample_submission.csv')
test['label'] = predicted_labels
test[['fname', 'label']].to_csv("1d_conv_ensembled_submission.csv", index=False)
Version 2 使用MFCC特征從頻域分類
有了以上兩個版本,再進行修改就很容易了。稍稍修改Config
類,加上和MFCC相關(guān)的配置
class Config(object):
def __init__(self, sampling_rate=16000, audio_duration=2, n_classes=41,
n_folds=10,learning_rate=0.0001,max_epochs=50,
use_mfcc=False,n_mfcc=20):
self.sampling_rate = sampling_rate
self.audio_duration = audio_duration
self.n_classes = n_classes
self.n_folds=n_folds
self.learning_rate = learning_rate
self.max_epochs = max_epochs
self.n_mfcc=n_mfcc
self.use_mfcc=use_mfcc
#送入網(wǎng)絡(luò)的幀數(shù)
self.audio_length = self.sampling_rate * self.audio_duration
if(use_mfcc):
#MFCC不是一個采樣點就計算一次,需要/512
self.dim=(self.n_mfcc,1 + int(np.floor(self.audio_length/512)),1)
else:
self.dim = (self.audio_length, 1)
DataGenerator
也增加上use_mfcc=True
的選項
class DataGenerator(Sequence):
def __init__(self, config, data_dir, list_IDs, labels=None,
batch_size=64, preprocessing_fn=lambda x: x):
self.config = config
self.data_dir = data_dir
self.list_IDs = list(list_IDs)
if(labels is None):
self.labels=None
else:
self.labels = list(labels)
self.batch_size = batch_size
self.preprocessing_fn = preprocessing_fn
#在on_epoch_end內(nèi)獲得了wav文件共有多少個,便于從ID對應(yīng)到文件名,在每個訓(xùn)練epoch結(jié)束之后也會執(zhí)行該函數(shù)
self.on_epoch_end()
self.dim = self.config.dim
# 以下兩個函數(shù)是必須在Sequence類里實現(xiàn)的方法
# 返回一共有多少個batch
def __len__(self):
return int(np.ceil(len(self.list_IDs) / self.batch_size))
# 返回第index個batch內(nèi)的內(nèi)容
def __getitem__(self, index):
# 這些一個batch內(nèi)的內(nèi)容由indexes指定,例如[a,a+1,a+2,…,a+batch_size]
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# 將indexes映射至ID,組成一個需要處理的文件ID列表
# 需要注意list_IDs如果是Dataframe,用[k]取元素取的是原表里面的元素
list_IDs_temp = [self.list_IDs[k] for k in indexes]
return self.__data_generation(list_IDs_temp)
def on_epoch_end(self):
self.indexes = np.arange(len(self.list_IDs))
def __data_generation(self, list_IDs_temp):
cur_batch_size = len(list_IDs_temp)
# 這里的*是因為self.dim類似shape,是一個多維的參數(shù),通過*當(dāng)成參數(shù)列表傳遞進去
X = np.empty((cur_batch_size, *self.dim))
input_length = self.config.audio_length
for i, ID in enumerate(list_IDs_temp):
file_path = self.data_dir + ID
# Read and Resample the audio
data, _ = librosa.core.load(file_path, sr=self.config.sampling_rate,
res_type='kaiser_fast')
# Random offset / Padding
# 如果音頻過長 從中取出指定長度的音頻 eg:len(data)=5000 input_length=1000
if len(data) > input_length:
# max_offset=4000
max_offset = len(data) - input_length
# offset=3214
offset = np.random.randint(max_offset)
# [3214:1000+3214]
data = data[offset:(input_length+offset)]
else:
# 如果音頻過短,補成指定長度
if input_length > len(data):
max_offset = input_length - len(data)
offset = np.random.randint(max_offset)
else:
offset = 0
data = np.pad(data, (offset, input_length - len(data) - offset), "constant")
# Normalization + Other Preprocessing
if(self.config.use_mfcc):
# 這里原文在處理MFCC的時候并沒有添加正則化,不知道為什么,MFCC信息也包含強度啊?
# 在Kaldi里面添加的能量信息又是怎么回事?
data=librosa.feature.mfcc(data,sr=self.config.sampling_rate,
n_mfcc=self.config.n_mfcc)
# 多增加了一個維度
data=np.reshape(data,(*data.shape,1))
else:
# 這里的preprocessing_fn是預(yù)處理函數(shù),對音頻進行歸一化處理,參考后一段代碼
# 增添一個新維度以符合網(wǎng)絡(luò)Input形狀
data = self.preprocessing_fn(data)[:, np.newaxis]
# 第i個X為處理之后的data
X[i,] = data
# 如果帶有l(wèi)abel(train)
if self.labels is not None:
y = np.empty(cur_batch_size, dtype=int)
for i in range(len(list_IDs_temp)):
y[i] = self.labels[i]
#for i, ID in enumerate(list_IDs_temp):
#y[i] = self.labels[ID]
return X, to_categorical(y, num_classes=self.config.n_classes)
#如果不帶label(test)
else:
return X
最后就是2D卷積模型了,
def get_2d_conv_model(config):
nclass = config.n_classes
inp = Input(shape=config.dim)
#inp = Input(shape=(config.dim[0],config.dim[1],1))
x = Convolution2D(32, (4,10), padding="same")(inp)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = MaxPool2D()(x)
x = Convolution2D(32, (4,10), padding="same")(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = MaxPool2D()(x)
x = Convolution2D(32, (4,10), padding="same")(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = MaxPool2D()(x)
x = Convolution2D(32, (4,10), padding="same")(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = MaxPool2D()(x)
x = Flatten()(x)
x = Dense(64)(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
out = Dense(nclass, activation=softmax)(x)
model = models.Model(inputs=inp, outputs=out)
opt = optimizers.Adam(config.learning_rate)
model.compile(optimizer=opt, loss=losses.categorical_crossentropy, metrics=['acc'])
return model
訓(xùn)練和Version 1的類似,只不過增加了Config
里MFCC的相關(guān)設(shè)置。
config = Config(sampling_rate=44100, audio_duration=2, n_folds=10,
learning_rate=0.001, use_mfcc=True, n_mfcc=40)
MFCC歸一化
需要注意的是,原文是直接用整個訓(xùn)練集去fit,不知道內(nèi)存會不會爆掉。之前時域上的正則化處理是對每個音頻的幅度做了個歸一化處理,然而原作者這里是得出了整個X_train
的MFCC特征,然而再做歸一化處理,然后進行訓(xùn)練。由于文章連續(xù)性的問題,我還是采用DataGenerator
的方法來訓(xùn)練,就沒有做全局的正則化了。
def prepare_data(df, config, data_dir):
X = np.empty(shape=(df.shape[0], config.dim[0], config.dim[1], 1))
input_length = config.audio_length
for i, fname in enumerate(df.index):
print(fname)
file_path = data_dir + fname
data, _ = librosa.core.load(file_path, sr=config.sampling_rate, res_type="kaiser_fast")
# Random offset / Padding
if len(data) > input_length:
max_offset = len(data) - input_length
offset = np.random.randint(max_offset)
data = data[offset:(input_length+offset)]
else:
if input_length > len(data):
max_offset = input_length - len(data)
offset = np.random.randint(max_offset)
else:
offset = 0
data = np.pad(data, (offset, input_length - len(data) - offset), "constant")
data = librosa.feature.mfcc(data, sr=config.sampling_rate, n_mfcc=config.n_mfcc)
data = np.expand_dims(data, axis=-1)
X[i,] = data
return X
X_train = prepare_data(train, config, '../input/freesound-audio-tagging/audio_train/')
X_test = prepare_data(test, config, '../input/freesound-audio-tagging/audio_test/')
y_train = to_categorical(train.label_idx, num_classes=config.n_classes)
mean = np.mean(X_train, axis=0)
std = np.std(X_train, axis=0)
X_train = (X_train - mean)/std
X_test = (X_test - mean)/std
結(jié)果與展望
原作者最終跑出來的結(jié)果發(fā)現(xiàn)時域和頻域結(jié)合起來的效果是最好的:
在語音識別中,對MFCC特征一般還會加上一階差分、二階差分、能量等信號,不知道增加這些參數(shù)效果會不會好一些。另外分幀的時長(一般為20ms~30ms,在這樣的時長內(nèi)既能捕捉到周期特性,也能捕捉到動態(tài)特定)也是針對人們的對話設(shè)定的,我想這樣的設(shè)定對該任務(wù)分類也是有影響的,有待后續(xù)測試。