Why do Mel-filterbank energies outperform MFCCs for speech commands recognition using CNN?

Question

Last month, a user called @jojek told me in a comment the following advice:

I can bet that given enough data, CNN on Mel energies will outperform MFCCs. You should try it. It makes more sense to do convolution on Mel spectrogram rather than on decorrelated coefficients.

Yes, I tried CNN on Mel-filterbank energies, and it outperformed MFCCs, but I still don't know the reason!

Although many tutorials, like this one by Tensorflow, encourage the use of MFCCs for such applications:

Because the human ear is more sensitive to some frequencies than others, it's been traditional in speech recognition to do further processing to this representation to turn it into a set of Mel-Frequency Cepstral Coefficients, or MFCCs for short.

Also, I want to know if Mel-Filterbank energies outperform MFCCs only with CNN, or this is also true with LSTM, DNN, ... etc. and I would appreciate it if you add a reference.

Update 1:

While my comment on @Nikolay's answer contains relevant details, I will add it here:

Correct me if I’m wrong, since applying DCT on the Mel-filterbank energies, in this case, is equivalent to IDFT, it seems to me that when we keep the 2-13 (inclusive) cepstral coefficients and discard the rest, is equivalent to a low-time liftering to isolate the vocal tract components, and drop the source components (which have e.g. the F0 spike).

So, why should I use all the 40 MFCCs since all I care about for the speech command recognition model is the vocal tract components?

Update 2

Another point of view (link) is:

Notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent fast changes in the filterbank energies and it turns out that these fast changes actually degrade ASR performance, so we get a small improvement by dropping them.

References:

https://tspace.library.utoronto.ca/bitstream/1807/44123/1/Mohamed_Abdel-rahman_201406_PhD_thesis.pdf

Nikolay Shmyrev · Accepted Answer

The thing is that the MFCC is calculated from mel energies with simple matrix multiplication and reduction of dimension. That matrix multiplication doesn't affect anything since any other neural networks applies many other operations afterwards.

What is important is reduction of dimension where instead of 40 mel energies you take 13 mel coefficients dropping the rest. That reduces accuracy with CNN, DNN or whatever.

However, if you don't drop and still use 40 MFCCs you can get the same accuracy as for mel energy or even better accuracy.

So it doesn't matter MEL or MFCC, it matters how many coefficients do you keep in your features.

Why do Mel-filterbank energies outperform MFCCs for speech commands recognition using CNN?

Tags:

deep-learning

conv-neural-network

speech-recognition

feature-extraction

mfcc

Abdulkader

1 Answers

Nikolay Shmyrev

Recent Activity

Donate For Us

Why do Mel-filterbank energies outperform MFCCs for speech commands recognition using CNN?

Tags:

deep-learning

conv-neural-network

speech-recognition

feature-extraction

mfcc

Abdulkader

1 Answers

Nikolay Shmyrev

Related questions

Recent Activity

Donate For Us