Transform the input of the MFCCs Spectogram for a CNN (Audio Recognition)

Question

I have a dataset of audios, and I have transformed these audios intro MFCCs plot like this one:

enter image description here

Now i want to feed my Neural network

import tensorflow as tf
import tensorflow.keras as tfk
import tensorflow.keras.layers as tfkl

cnn_model = tfk.Sequential(name='CNN_model')
cnn_model.add(tfkl.Conv1D(filters= 225, kernel_size= 11, padding='same', activation='relu', input_shape=(4500,9000, 3)))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.Bidirectional(tfkl.GRU(200, activation='relu', return_sequences=True, implementation=0)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())
cnn_model.compile(loss='mae', optimizer='Adam', metrics=['mae'])

cnn_model.summary()

I use a Conv1D because is the used layer in this kind of NN. But I don't know how to make the data transformation from image, to the input of the CNN. I have tried several transformations by my own, but just it didn't work.

As you can see in the picture below, i need to feed the first layer that is a Conv1D but i can't because the shape of my image is (4500, 9000, 3). So basically, what i want to do, is transform this image in an input for a Conv1D in the same way that in the image below.

enter image description here

This image represent 1 audio passed to the NN.

Obviously, when I pass one image with this shape to a Conv1D layer, I have a ValueError ValueError: Input 0 of layer conv1d_4 is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [None, 4500, 9000, 3]

I pass my image into greyscale, but is not the method and I lost a valuable information.

skillsmuggler · Accepted Answer

I think you can convert the image into grayscale, but you risk losing a lot of valuable data.

The best possible approach is to reshape the MFCC spectogram. img.reshape(4500, 3 * 9000)

Example

# Sample data
>>> a
array([[[1, 1, 1],
        [2, 2, 2]],

       [[3, 3, 3],
        [4, 4, 4]]])
>>> a.shape
(2, 2, 3)

# Reshaping data
>>> a.reshape(2, -1)
array([[1, 1, 1, 2, 2, 2],
       [3, 3, 3, 4, 4, 4]])

# Or
>>> a.reshape(2, 6)
array([[1, 1, 1, 2, 2, 2],
       [3, 3, 3, 4, 4, 4]])

thushv89 · Answer

I feel like you are not looking at this as a typical speech recognition problem. Because I find several strange choices in your approach.

Problems I noted

The output shape of the MFCC operation.

If you look at librosa.feature.mfcc, this is what it says,

Returns: M:np.ndarray [shape=(n_mfcc, t)]

So as you can see, there are no channels here. There's the input dimension (n_mfcc) and time dimension (t). Therefore, you should be directly able to use Conv1D without any preprocessing.

Dropout before SoftMax

This is what the tail of your algorithm look like,

cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())

Personally, I haven't used people using dropout on the last layer. So I would get rid of that. Because dropout randomly switches of neurons. But you want all your output nodes on at any time.

The loss function

Usually, CTC is what is used to optimize speech recognition models. I (personally) haven't seen anybody using mae as a loss for a speech model. Because, your input data and label data usually have mis-aligned time dimensions. This means, there not always a label corresponding to each time step of the prediction. And that's where CTC loss shines. That's probably what you want to use for this model (unless you are 100% certain that there is a label for each an every single prediction and they are perfectly aligned).

Having said this, the loss depends on the problem you're solving. But I will include an example of how to use this loss for this problem.

A working example

The dataset

To show a working example, I'm going to use the this speech dataset. I chose this because, I can get a good result quickly due to the simplicity of the problem.

Input: An audio
Output: A label 0-9

MFCC transformation

Then you can perform MFCC on the audio files, and you will get the following heatmap. So as I said before, this will be a 2D matrix (n_mfcc, timesteps) sized array. With the batch dimension it becomes, (batch size, n_mfcc, timesteps).

enter image description here

Here's how you can visualize the above. Here, y is an audio loaded via librosa.core.load() function.

y = audios[aid][1][0]
sr = audios[aid][1][1]
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
print(mfcc.shape)

plt.figure(figsize=(6, 4))
librosa.display.specshow(mfcc, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.tight_layout()

Creating training/testing data

Next you can create your training and testing data. Here's what I create.

train_data - A (sample size, timesteps, n_mfcc) size array
train_labels = A (sample size, timesteps, num_classes) size array
train_inp_lengths - A (sample size,)` size array (for CTC loss)
train_seq_lengths - A (sample size,)` size array (for CTC loss)
test_data - A (sample size, timesteps, n_mfcc) size array
test_labels = A (sample size, timesteps, num_classes+1) size array
test_inp_lengths - A (sample size,)` size array (for CTC loss)
test_seq_lengths - A (sample size,)` size array (for CTC loss)

I am using the following mapping to convert chars to numbers

alphabet = 'abcdefghijklmnopqrstuvwxyz '
a_map = {} # map letter to number
rev_a_map = {} # map number to letter
for i, a in enumerate(alphabet):
  a_map[a] = i
  rev_a_map[i] = a

label_map = {0:'zero', 1:'one', 2:'two', 3:'three', 4:'four', 5:'five', 6:'six', 7: 'seven', 8: 'eight', 9:'nine'}

Few things to note.

Note that mfcc operation returns (n_mfcc, time). You have to do an axis permutation to get it to (time, n_mfcc) format. So that the convolution happens on the time dimension.
I also had to make sure that labels have exact same number of timesteps as the input (this is not necessary for the ctc_loss). But was a requirement enforced by keras model definition. This is done by adding spaces to the end of each sequence of chars.

Defining the model

I have changed from a sequential API to a functional API, as I needed to included several input layers to make this work for ctc_loss. Furthermore, I got rid of that last dropout layer.

def ctc_loss(inp_lengths, seq_lengths):
    def loss(y_true, y_pred):
        l = tf.reduce_mean(K.ctc_batch_cost(tf.argmax(y_true, axis=-1), y_pred, inp_lengths, seq_lengths))        
        return l            
    return loss

K.clear_session()
inp = tfk.Input(shape=(10,50))
inp_len = tfk.Input(shape=(1))
seq_len = tfk.Input(shape=(1))
out = tfkl.Conv1D(filters= 128, kernel_size= 5, padding='same', activation='relu')(inp)
out = tfkl.BatchNormalization()(out)
out = tfkl.Bidirectional(tfkl.GRU(128, return_sequences=True, implementation=0))(out)
out = tfkl.Dropout(0.2)(out)
out = tfkl.BatchNormalization()(out)
out = tfkl.TimeDistributed(tfkl.Dense(27, activation='softmax'))(out)
cnn_model = tfk.models.Model(inputs=[inp, inp_len, seq_len], outputs=out)
cnn_model.compile(loss=ctc_loss(inp_lengths=inp_len , seq_lengths=seq_len), optimizer='Adam', metrics=['mae'])

Training the model

Then you simply call,

cnn_model.fit([train_data, train_inp_lengths, train_seq_lengths], train_labels, batch_size=64, epochs=20)

which gave,

Train on 900 samples
Epoch 1/20
900/900 [==============================] - 3s 3ms/sample - loss: 11.4955 - mean_absolute_error: 0.0442
Epoch 2/20
900/900 [==============================] - 2s 2ms/sample - loss: 4.1317 - mean_absolute_error: 0.0340
...
Epoch 19/20
900/900 [==============================] - 2s 2ms/sample - loss: 0.1162 - mean_absolute_error: 0.0275
Epoch 20/20
900/900 [==============================] - 2s 2ms/sample - loss: 0.1012 - mean_absolute_error: 0.0277

Predicting with the model

y = cnn_model.predict([test_data, test_inp_lengths, test_seq_lengths])

n_ids = 5

for pred, true in zip(y[:n_ids,:,:], test_labels[:n_ids,:,:]):
  pred_ids = np.argmax(pred,axis=-1)
  true_ids = np.argmax(true, axis=-1)
  print('pred > ',[rev_a_map[tid] for tid in pred_ids])
  print('true > ',[rev_a_map[tid] for tid in true_ids])

this gives,

pred >  ['e', ' ', 'i', 'i', 'i', 'g', 'h', ' ', ' ', 't']
true >  ['e', 'i', 'g', 'h', 't', ' ', ' ', ' ', ' ', ' ']

pred >  ['o', ' ', ' ', 'n', 'e', ' ', ' ', ' ', ' ', ' ']
true >  ['o', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['s', 'e', ' ', ' ', ' ', ' ', ' ', ' ', 'v', 'e']
true >  ['s', 'e', 'v', 'e', 'n', ' ', ' ', ' ', ' ', ' ']

pred >  ['z', 'e', ' ', ' ', ' ', ' ', ' ', 'r', 'o', ' ']
true >  ['z', 'e', 'r', 'o', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['n', ' ', ' ', 'i', 'i', 'n', 'e', ' ', ' ', ' ']
true >  ['n', 'i', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ']

To get rid of repeating letters and spaces in between, use ctc_decode function as follows.

y = cnn_model.predict([test_data, test_inp_lengths, test_seq_lengths])

sess = K.get_session()
pred = sess.run(tf.keras.backend.ctc_decode(y, test_inp_lengths[:,0]))

rev_a_map[-1] = '-'

for pred, true in zip(pred[0][0][:n_ids,:], test_labels[:n_ids,:,:]):
  print(pred.shape)  
  true_ids = np.argmax(true, axis=-1)
  print('pred > ',[rev_a_map[tid] for tid in pred])
  print('true > ',[rev_a_map[tid] for tid in true_ids])

which gave,

pred >  ['e', 'i', 'g', 'h', 't']
true >  ['e', 'i', 'g', 'h', 't', ' ', ' ', ' ', ' ', ' ']

pred >  ['o', 'n', 'e', '-', '-']
true >  ['o', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['s', 'e', 'i', 'v', 'n']
true >  ['s', 'e', 'v', 'e', 'n', ' ', ' ', ' ', ' ', ' ']

pred >  ['z', 'e', 'r', 'o', '-']
true >  ['z', 'e', 'r', 'o', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['n', 'i', 'n', 'e', '-']
true >  ['n', 'i', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ']

Note that I have added a new label -1. This is something added to represent any blanks by the ctc_decode function.

Transform the input of the MFCCs Spectogram for a CNN (Audio Recognition)

Tags:

python

tensorflow

keras

conv-neural-network

speech-recognition

Rubiales Alberto

2 Answers

skillsmuggler

Problems I noted

The output shape of the MFCC operation.

Dropout before SoftMax

The loss function

A working example

The dataset

MFCC transformation

Creating training/testing data

Defining the model

Training the model

Predicting with the model

thushv89

Recent Activity

Donate For Us

Transform the input of the MFCCs Spectogram for a CNN (Audio Recognition)

Tags:

python

tensorflow

keras

conv-neural-network

speech-recognition

Rubiales Alberto

2 Answers

skillsmuggler

Problems I noted

The output shape of the MFCC operation.

Dropout before SoftMax

The loss function

A working example

The dataset

MFCC transformation

Creating training/testing data

Defining the model

Training the model

Predicting with the model

thushv89

Related questions

Recent Activity

Donate For Us