I am currently working on a project for my university. The task is to write speech recognition system that is going to run on a phone in background waiting for few commands (like. call 0 123 ...).
It's 2 months project so it does not have to be very accurate. The amount of acceptable noise can be small and words will be separated by moments of silence.
I am currently at point of loading sample word encoded in RAW 16 bit PCM format. Splitting it to chunks (about 50 per second) and running FFT on each chunk in order to get frequency spectrum.
Things to solve are: 1) going through the longer recording and splitting it into words. 2) finding to best match for the word
1) I was thinking about just checking chunk after chunk and if I encounter few chunks that have higher altitudes of human voice frequencies assume that the word has started. Anyway I am looking for resources that may help with this.
2) This one seams a little bit tougher. Is it necessary to use HMM's for system like this or maybe there are simpler methods assuming that the vocabulary is so small ( 20 words )?
Edit: The point of the project is writing the system on my own so I cannot use ready libraries like Sphinx or HTK.
Regards, Karol
If anybody will have the same question in future. Look for 2 main keywords:
MFCC - Mel-Frequency cepstrum coefficients to calculate series of coefficients for each word template
DTW - To match captured word with templates Good enough description of DTW can be found on wikipedia
This approach was good enough to have around 80% accuracy on 20 words dictionary and give a good demo during the class.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With