This is for a personal project of mine, and I have no idea from where to start as it falls way beyond my comfort zone.
I know that there are a few language learning software out there that allows the user to record his or her voice and compare the pronounce with a native speaker of said language.
My question is, how to achieve this?
I mean, how one compares the pronunciation between the user and the native speaker?
If you're looking for something relatively simple, you could simply compute the MFCC (http://en.wikipedia.org/wiki/Mel-frequency_cepstrum) of the recording, and then look at something simple like the correlation between the recording and the average coefficients of that word being pronounced by a native speaker. The MFCC will transform the audio into a space where euclidean distance corresponds more closely with perceptual difference.
Of course, there are several possible problems:
Aligning the two recordings so the coefficients match up. To fix this, you could look at the maximum cross-correlation of the coefficients, rather than the simple correlation, so you will get an automatic "best alignment" for free. Also, you might have to clip off ends of the recording, so only the actual pronunciation of the word remains in the recording.
The MFCC maps to perceptual space, but might not correspond so well to accent inaccuracies. You could perhaps try to fix this by instead of comparing it to just the "ideal" pronunciation, comparing it to the average for several different types of mispronunciation, and looking at which model it is closest to.
Even good accented words will be on average some "distance" from the ideal. You'll have to take that into account, and compare the input's distance to the "relative" good distance.
Correlation might not be the best way to compare the relative similarity of two sounds. Experiment with lots of different metrics... try different L^p norms: (http://en.wikipedia.org/wiki/Lp_space), or try weighing the different MFCCs differently (if I recall, even after MFCC have been taken, although they are all supposed to have the same perceptual "weight", the ones in the middle are still more important for how we perceive a sound than the high or low ones.)
There might be certain parts of the sound where the pronunciation matters much more for the quality of the accent. Perhaps transient detection to find those positions and mark them as more important would be good. If you had a whole bunch of "good pronunciation" and "bad pronunciation" examples, you could probably automatically extract those locations.
Again, in the end the only way you're going to know which combination of these options works best is by testing.
I've read about adapting gaussian mixture models for the phonetic space of a general speaker to an individual. This might be useful for training for a non-canonical accent for private use.
If you just compare the speaker to a general pronunciation model, then the match might not be very good. So the idea is to adjust the models to fit the speaker better during individual training.
Speaker Verification using Adapted Gaussian Mixture Models
EDIT: looking over your question again, I think I answered a different question. But the technique uses similar models:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With