Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare word pronounce?

This is for a personal project of mine, and I have no idea from where to start as it falls way beyond my comfort zone.

I know that there are a few language learning software out there that allows the user to record his or her voice and compare the pronounce with a native speaker of said language.

My question is, how to achieve this?

I mean, how one compares the pronunciation between the user and the native speaker?

like image 341
Paulo Santos Avatar asked Oct 11 '22 19:10

Paulo Santos


2 Answers

If you're looking for something relatively simple, you could simply compute the MFCC (http://en.wikipedia.org/wiki/Mel-frequency_cepstrum) of the recording, and then look at something simple like the correlation between the recording and the average coefficients of that word being pronounced by a native speaker. The MFCC will transform the audio into a space where euclidean distance corresponds more closely with perceptual difference.

Of course, there are several possible problems:

  1. Aligning the two recordings so the coefficients match up. To fix this, you could look at the maximum cross-correlation of the coefficients, rather than the simple correlation, so you will get an automatic "best alignment" for free. Also, you might have to clip off ends of the recording, so only the actual pronunciation of the word remains in the recording.

  2. The MFCC maps to perceptual space, but might not correspond so well to accent inaccuracies. You could perhaps try to fix this by instead of comparing it to just the "ideal" pronunciation, comparing it to the average for several different types of mispronunciation, and looking at which model it is closest to.

  3. Even good accented words will be on average some "distance" from the ideal. You'll have to take that into account, and compare the input's distance to the "relative" good distance.

  4. Correlation might not be the best way to compare the relative similarity of two sounds. Experiment with lots of different metrics... try different L^p norms: (http://en.wikipedia.org/wiki/Lp_space), or try weighing the different MFCCs differently (if I recall, even after MFCC have been taken, although they are all supposed to have the same perceptual "weight", the ones in the middle are still more important for how we perceive a sound than the high or low ones.)

  5. There might be certain parts of the sound where the pronunciation matters much more for the quality of the accent. Perhaps transient detection to find those positions and mark them as more important would be good. If you had a whole bunch of "good pronunciation" and "bad pronunciation" examples, you could probably automatically extract those locations.

Again, in the end the only way you're going to know which combination of these options works best is by testing.

like image 101
Jeremy Salwen Avatar answered Nov 01 '22 07:11

Jeremy Salwen


I've read about adapting gaussian mixture models for the phonetic space of a general speaker to an individual. This might be useful for training for a non-canonical accent for private use.

If you just compare the speaker to a general pronunciation model, then the match might not be very good. So the idea is to adjust the models to fit the speaker better during individual training.

Speaker Verification using Adapted Gaussian Mixture Models

EDIT: looking over your question again, I think I answered a different question. But the technique uses similar models:

  1. Model various language (Do you have lots of data for different languages? Collecting the data might be the hard part.) GMMs work well for this.
  2. Compare the data point from the speaker to the various language models
  3. Choose the model that is the best predictor for the speaker data as the winner.
like image 23
Atreys Avatar answered Nov 01 '22 07:11

Atreys