Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split speech data on frames and compute MFCC

I understand the basic steps of creating an automated speech recognition engine. However, I need a clear-er idea of how segmentation is done and what are frames and samples. I will write down what I know and expect the answer-er to correct me in the places where I'm wrong and guide me further.

The basic steps of Speech Recognition as I know it are:

(I'm assuming the input data is a wav/ogg (or some kind of audio) file)

  1. Pre-emphasize the speech signal : i.e., Apply a filter that will put an emphasis to high frequency signals. Possibly something like: y[n] = x[n] - 0.95 x[n-1]
  2. Find the time from which the utterances start and resize the clip. (Interchangable with Step 1)
  3. Segment the clip into smaller time frames, each segment being like 30msecs long. Further, Each segment will have about 256 Frames and two segments will have a seperation of 100 Frames? (i.e., 30*100/256 msec ?)
  4. Apply Hamming Window to each frame (1/256th of a segment)? The result is a array of frame of signals.
  5. Fast Fourier Transform the signal of each frame represented by X(t)
  6. Mel Filter Bank Processing: (Not yet Went into Detail)
  7. Discrete Cosine Transform: (Not yet Went into Detail - but know that this will give me a set of MFCCs, also called acoustic vectors for each input utterance.
  8. Delta Energy and Delta Spectrum: I Know that this is used to calculate delta and double delta coefficients of MFCCs, not much.
  9. After this, I think I need to use HMMs or ANNs to classify the Mel Frequency Cepstrum Coefficients (delta and double delta) to corresponding phonemes and perform analysis to match the phonemes to words and respectively words to sentences.

Although these are clear to me, I am confused if step 3 is correct. If It is correct, In the steps following 3, do I apply that to each frame? Also, after step 6, I think that each frame has their own set of MFCC, am I right?

Thank you in advance!

like image 631
cipher Avatar asked Jan 08 '16 08:01

cipher


People also ask

How MFCC features are extracted for the speech recognition?

The MFCC feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying the inverse DCT.

What is framing in MFCC?

Framing is the process of dividing the speech signal into small frames typically in the range of 5 to 50 milliseconds. The next step windowing is the process to window each frame to reduce discontinuities and leakage at start and end of each frame [1]. MFCC features are calculated for each frame.

What is the output of MFCC feature extraction?

The output after applying MFCC is a matrix having feature vectors extracted from all the frames. In this output matrix the rows represent the corresponding frame numbers and columns represent corresponding feature vector coefficients [1-4]. Finally this output matrix is used for classification process.


1 Answers

Segment the clip into smaller time frames, each segment being like 30msecs long. Further, Each segment will have about 256 Frames and two segments will have a seperation of 100 Frames? (i.e., 30*100/256 msec ?)

Not frames, but samples. Each frame of 30ms at 8khz sample rate is 30/1000 * 8000 = 240 samples. Frames are overlapped and shift between frames is 10ms or 80 samples. Here how it looks on the picture:

Signal split on frames

Here Q is 80 and K is 240 samples.

If it is correct, in the steps following 3, do I apply that to each frame?

Yes

Also, after step 6, I think that each frame has their own set of MFCC, am I right.

Yes.

like image 59
Nikolay Shmyrev Avatar answered Oct 19 '22 04:10

Nikolay Shmyrev