I have a dataset of audios, and I have transformed these audios intro MFCCs plot like this one:
Now i want to feed my Neural network
import tensorflow as tf
import tensorflow.keras as tfk
import tensorflow.keras.layers as tfkl
cnn_model = tfk.Sequential(name='CNN_model')
cnn_model.add(tfkl.Conv1D(filters= 225, kernel_size= 11, padding='same', activation='relu', input_shape=(4500,9000, 3)))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.Bidirectional(tfkl.GRU(200, activation='relu', return_sequences=True, implementation=0)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())
cnn_model.compile(loss='mae', optimizer='Adam', metrics=['mae'])
cnn_model.summary()
I use a Conv1D because is the used layer in this kind of NN. But I don't know how to make the data transformation from image, to the input of the CNN. I have tried several transformations by my own, but just it didn't work.
As you can see in the picture below, i need to feed the first layer that is a Conv1D
but i can't because the shape of my image is (4500, 9000, 3)
. So basically, what i want to do, is transform this image in an input for a Conv1D
in the same way that in the image below.
This image represent 1 audio passed to the NN.
Obviously, when I pass one image with this shape to a Conv1D
layer, I have a ValueError
ValueError: Input 0 of layer conv1d_4 is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [None, 4500, 9000, 3]
I pass my image into greyscale, but is not the method and I lost a valuable information.
I think you can convert the image into grayscale, but you risk losing a lot of valuable data.
The best possible approach is to reshape the MFCC spectogram. img.reshape(4500, 3 * 9000)
Example
# Sample data
>>> a
array([[[1, 1, 1],
[2, 2, 2]],
[[3, 3, 3],
[4, 4, 4]]])
>>> a.shape
(2, 2, 3)
# Reshaping data
>>> a.reshape(2, -1)
array([[1, 1, 1, 2, 2, 2],
[3, 3, 3, 4, 4, 4]])
# Or
>>> a.reshape(2, 6)
array([[1, 1, 1, 2, 2, 2],
[3, 3, 3, 4, 4, 4]])
I feel like you are not looking at this as a typical speech recognition problem. Because I find several strange choices in your approach.
If you look at librosa.feature.mfcc, this is what it says,
Returns: M:np.ndarray [shape=(n_mfcc, t)]
So as you can see, there are no channels here. There's the input dimension (n_mfcc
) and time dimension (t
). Therefore, you should be directly able to use Conv1D
without any preprocessing.
This is what the tail of your algorithm look like,
cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())
Personally, I haven't used people using dropout on the last layer. So I would get rid of that. Because dropout randomly switches of neurons. But you want all your output nodes on at any time.
Usually, CTC is what is used to optimize speech recognition models. I (personally) haven't seen anybody using mae
as a loss for a speech model. Because, your input data and label data usually have mis-aligned time dimensions. This means, there not always a label corresponding to each time step of the prediction. And that's where CTC loss shines. That's probably what you want to use for this model (unless you are 100% certain that there is a label for each an every single prediction and they are perfectly aligned).
Having said this, the loss depends on the problem you're solving. But I will include an example of how to use this loss for this problem.
To show a working example, I'm going to use the this speech dataset. I chose this because, I can get a good result quickly due to the simplicity of the problem.
Then you can perform MFCC on the audio files, and you will get the following heatmap. So as I said before, this will be a 2D matrix (n_mfcc, timesteps)
sized array. With the batch dimension it becomes, (batch size, n_mfcc, timesteps)
.
Here's how you can visualize the above. Here, y is an audio loaded via librosa.core.load()
function.
y = audios[aid][1][0]
sr = audios[aid][1][1]
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
print(mfcc.shape)
plt.figure(figsize=(6, 4))
librosa.display.specshow(mfcc, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.tight_layout()
Next you can create your training and testing data. Here's what I create.
(sample size, timesteps, n_mfcc)
size array(sample size, timesteps, num_classes)
size array(sample size,
)` size array (for CTC loss)train_seq_lengths - A (sample size,
)` size array (for CTC loss)
test_data - A (sample size, timesteps, n_mfcc)
size array
(sample size, timesteps, num_classes+1)
size array(sample size,
)` size array (for CTC loss)(sample size,
)` size array (for CTC loss)I am using the following mapping to convert chars to numbers
alphabet = 'abcdefghijklmnopqrstuvwxyz '
a_map = {} # map letter to number
rev_a_map = {} # map number to letter
for i, a in enumerate(alphabet):
a_map[a] = i
rev_a_map[i] = a
label_map = {0:'zero', 1:'one', 2:'two', 3:'three', 4:'four', 5:'five', 6:'six', 7: 'seven', 8: 'eight', 9:'nine'}
Few things to note.
mfcc
operation returns (n_mfcc, time)
. You have to do an axis permutation to get it to (time, n_mfcc)
format. So that the convolution happens on the time dimension.I have changed from a sequential API to a functional API, as I needed to included several input layers to make this work for ctc_loss
. Furthermore, I got rid of that last dropout layer.
def ctc_loss(inp_lengths, seq_lengths):
def loss(y_true, y_pred):
l = tf.reduce_mean(K.ctc_batch_cost(tf.argmax(y_true, axis=-1), y_pred, inp_lengths, seq_lengths))
return l
return loss
K.clear_session()
inp = tfk.Input(shape=(10,50))
inp_len = tfk.Input(shape=(1))
seq_len = tfk.Input(shape=(1))
out = tfkl.Conv1D(filters= 128, kernel_size= 5, padding='same', activation='relu')(inp)
out = tfkl.BatchNormalization()(out)
out = tfkl.Bidirectional(tfkl.GRU(128, return_sequences=True, implementation=0))(out)
out = tfkl.Dropout(0.2)(out)
out = tfkl.BatchNormalization()(out)
out = tfkl.TimeDistributed(tfkl.Dense(27, activation='softmax'))(out)
cnn_model = tfk.models.Model(inputs=[inp, inp_len, seq_len], outputs=out)
cnn_model.compile(loss=ctc_loss(inp_lengths=inp_len , seq_lengths=seq_len), optimizer='Adam', metrics=['mae'])
Then you simply call,
cnn_model.fit([train_data, train_inp_lengths, train_seq_lengths], train_labels, batch_size=64, epochs=20)
which gave,
Train on 900 samples
Epoch 1/20
900/900 [==============================] - 3s 3ms/sample - loss: 11.4955 - mean_absolute_error: 0.0442
Epoch 2/20
900/900 [==============================] - 2s 2ms/sample - loss: 4.1317 - mean_absolute_error: 0.0340
...
Epoch 19/20
900/900 [==============================] - 2s 2ms/sample - loss: 0.1162 - mean_absolute_error: 0.0275
Epoch 20/20
900/900 [==============================] - 2s 2ms/sample - loss: 0.1012 - mean_absolute_error: 0.0277
y = cnn_model.predict([test_data, test_inp_lengths, test_seq_lengths])
n_ids = 5
for pred, true in zip(y[:n_ids,:,:], test_labels[:n_ids,:,:]):
pred_ids = np.argmax(pred,axis=-1)
true_ids = np.argmax(true, axis=-1)
print('pred > ',[rev_a_map[tid] for tid in pred_ids])
print('true > ',[rev_a_map[tid] for tid in true_ids])
this gives,
pred > ['e', ' ', 'i', 'i', 'i', 'g', 'h', ' ', ' ', 't']
true > ['e', 'i', 'g', 'h', 't', ' ', ' ', ' ', ' ', ' ']
pred > ['o', ' ', ' ', 'n', 'e', ' ', ' ', ' ', ' ', ' ']
true > ['o', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
pred > ['s', 'e', ' ', ' ', ' ', ' ', ' ', ' ', 'v', 'e']
true > ['s', 'e', 'v', 'e', 'n', ' ', ' ', ' ', ' ', ' ']
pred > ['z', 'e', ' ', ' ', ' ', ' ', ' ', 'r', 'o', ' ']
true > ['z', 'e', 'r', 'o', ' ', ' ', ' ', ' ', ' ', ' ']
pred > ['n', ' ', ' ', 'i', 'i', 'n', 'e', ' ', ' ', ' ']
true > ['n', 'i', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ']
To get rid of repeating letters and spaces in between, use ctc_decode
function as follows.
y = cnn_model.predict([test_data, test_inp_lengths, test_seq_lengths])
sess = K.get_session()
pred = sess.run(tf.keras.backend.ctc_decode(y, test_inp_lengths[:,0]))
rev_a_map[-1] = '-'
for pred, true in zip(pred[0][0][:n_ids,:], test_labels[:n_ids,:,:]):
print(pred.shape)
true_ids = np.argmax(true, axis=-1)
print('pred > ',[rev_a_map[tid] for tid in pred])
print('true > ',[rev_a_map[tid] for tid in true_ids])
which gave,
pred > ['e', 'i', 'g', 'h', 't']
true > ['e', 'i', 'g', 'h', 't', ' ', ' ', ' ', ' ', ' ']
pred > ['o', 'n', 'e', '-', '-']
true > ['o', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
pred > ['s', 'e', 'i', 'v', 'n']
true > ['s', 'e', 'v', 'e', 'n', ' ', ' ', ' ', ' ', ' ']
pred > ['z', 'e', 'r', 'o', '-']
true > ['z', 'e', 'r', 'o', ' ', ' ', ' ', ' ', ' ', ' ']
pred > ['n', 'i', 'n', 'e', '-']
true > ['n', 'i', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ']
-1
. This is something added to represent any blanks by the ctc_decode
function.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With