Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the output of mfcc

from librosa.feature import mfcc
from librosa.core import load

def extract_mfcc(sound):
    data, frame = load(sound)
    return mfcc(data, frame)


mfcc = extract_mfcc("sound.wav")

I would like to get the MFCC of the following sound.wav file which is 48 seconds long.

I understand that the data * frame = length of audio.

But when I compute the MFCC as shown above and get its shape, this is the result: (20, 2086)

What do those numbers represent? How can I calculate the time of the audio just by its MFCC?

I'm trying to calculate the average MFCC per ms of audio.

Any help is appreciated! Thank you :)

like image 736
Eduardo Morales Avatar asked Sep 08 '18 06:09

Eduardo Morales


People also ask

What is MFCC and how it works?

The MFCC feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying the inverse DCT. The detailed description of various steps involved in the MFCC feature extraction is explained below.

What does MFCC measure?

The mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10-20) which concisely describe the overall shape of a spectral envelope. In MIR, it is often used to describe timbre.

Why is MFCC used for feature extraction?

It is observed that extracting features from the audio signal and using it as input to the base model will produce much better performance than directly considering raw audio signal as input. MFCC is the widely used technique for extracting the features from the audio signal.

What are the 39 features of MFCC?

So the 39 MFCC features parameters are 12 Cepstrum coefficients plus the energy term. Then we have 2 more sets corresponding to the delta and the double delta values. Next, we can perform the feature normalization. We normalize the features with its mean and divide it by its variance.


1 Answers

That's because mel-frequency cepstral coefficients are computed over a window, i.e. number of samples. Sound is wave and one cannot derive any features by taking a single sample (number), hence the window.

To compute MFCC, fast Fourier transform (FFT) is used and that exactly requires that length of a window is provided. If you check librosa documentation for mfcc you won't find this as an explicit parameter. That's because it's implicit, specifically:

  • length of the FFT window: 2048
  • number of samples between successive frames: 512

They are passed as **kwargs and defined here.

If you now take into account sampling frequency of your audio and these numbers. you will arrive at the final result you have provided.

Since the default sampling rate for librosa is 22050, audio length is 48s and window equals 512, here's what follows:

Formula

The number is not exactly 2086, as:

  • Your audio length isn't exacatly 48 seconds
  • The actual window length is 2048, with 512 hop. That means you will "loose" a few frames at the end.
like image 175
Lukasz Tracewski Avatar answered Sep 21 '22 23:09

Lukasz Tracewski