How to split speech data on frames and compute MFCC

Tags:

I understand the basic steps of creating an automated speech recognition engine. However, I need a clear-er idea of how segmentation is done and what are frames and samples. I will write down what I know and expect the answer-er to correct me in the places where I'm wrong and guide me further.

The basic steps of Speech Recognition as I know it are:

(I'm assuming the input data is a wav/ogg (or some kind of audio) file)

Pre-emphasize the speech signal : i.e., Apply a filter that will put an emphasis to high frequency signals. Possibly something like: y[n] = x[n] - 0.95 x[n-1]
Find the time from which the utterances start and resize the clip. (Interchangable with Step 1)
Segment the clip into smaller time frames, each segment being like 30msecs long. Further, Each segment will have about 256 Frames and two segments will have a seperation of 100 Frames? (i.e., 30*100/256 msec ?)
Apply Hamming Window to each frame (1/256th of a segment)? The result is a array of frame of signals.
Fast Fourier Transform the signal of each frame represented by X(t)
Mel Filter Bank Processing: (Not yet Went into Detail)
Discrete Cosine Transform: (Not yet Went into Detail - but know that this will give me a set of MFCCs, also called acoustic vectors for each input utterance.
Delta Energy and Delta Spectrum: I Know that this is used to calculate delta and double delta coefficients of MFCCs, not much.
After this, I think I need to use HMMs or ANNs to classify the Mel Frequency Cepstrum Coefficients (delta and double delta) to corresponding phonemes and perform analysis to match the phonemes to words and respectively words to sentences.

Although these are clear to me, I am confused if step 3 is correct. If It is correct, In the steps following 3, do I apply that to each frame? Also, after step 6, I think that each frame has their own set of MFCC, am I right?

Thank you in advance!

631

asked Jan 08 '16 08:01

cipher

1 Answers

Segment the clip into smaller time frames, each segment being like 30msecs long. Further, Each segment will have about 256 Frames and two segments will have a seperation of 100 Frames? (i.e., 30*100/256 msec ?)

Not frames, but samples. Each frame of 30ms at 8khz sample rate is 30/1000 * 8000 = 240 samples. Frames are overlapped and shift between frames is 10ms or 80 samples. Here how it looks on the picture:

Signal split on frames

Here Q is 80 and K is 240 samples.

If it is correct, in the steps following 3, do I apply that to each frame?

Yes

Also, after step 6, I think that each frame has their own set of MFCC, am I right.

Yes.

answered Oct 19 '22 04:10

Nikolay Shmyrev

Related questions
                            
                                recognize_google speech recognition broken pipe python
                            
                                OSError: No Default Input Device Available
                            
                                Programmatically toggle dictation on MacOS
                            
                                Redirecting all input from Dragon NaturallySpeaking to Python? (Using Natlink)
                            
                                Simple Grammar for Speech Recognition
                            
                                Good speech recognition engine for Mac, not iOS?
                            
                                ALSA lib pcm_hw.c:1667:(_snd_pcm_hw_open) Invalid value for card arecord: main:722: audio open error: No such file or directory
                            
                                pyspeech (python) - Transcribe mp3 files?
                            
                                System.Speech.Recognition Choosing Recognition Profile
                            
                                Python Speech Compare
                            
                                Speech Recognition(Speech To Text) is not working in android 4.2.2
                            
                                Python SpeechRecognition ignores timeout when listening, and hangs
                            
                                Android Speechrecognizer stopListening() has no effect?
                            
                                Android record audio while doing speech recognition
                            
                                How to get voice in raw format by using mic in linux
                            
                                Why do Mel-filterbank energies outperform MFCCs for speech commands recognition using CNN?
                            
                                Compare voice wav in android or voice tag ( voice commands ) API
                            
                                Speech Recognition on Kinect
                            
                                SpeechRecognition recognizes background noise as speech
                            
                                Offline voice recognition android taking unwanted voice

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split speech data on frames and compute MFCC

Tags:

speech-recognition

speech

speech-to-text

cmusphinx

cipher

People also ask

1 Answers

Nikolay Shmyrev

Recent Activity

Donate For Us