I would like to build a program to detect how close a user's audio recording is to another recording in order to correct the user's pronunciation. For example: <ol> <li>I record myself saying "Good morning"</li> <li>I let a foreign student record "Good morning"</li> <li>Compare his recording to mine to see if his pronunciation was good enough.</li> </ol> I've seen this in some language learning tools (I believe Rosetta Stone does this), but how is it done? Note we're only dealing with speech (and not, say, music). What are some algorithms or libraries I should look into?

A lot of people seem to be suggesting some sort of edit distance, which IMO is a totally wrong approach for determining the similarity of two speech patterns, especially for patterns as short as OP is implying. The specific algorithms used by speech-recognition in fact are nearly the opposite of what you would like to use here. The problem in speech recognition is resolving many similar pronunciations to the same representation. The problem here is to take a number of slightly different pronunciations and get some kind of meaningful distance between them. I've done quite a bit of this stuff for large scale data science, and while I can't comment on exactly how proprietary programs do it, I can comment on how it's done in academia and provide a solution that is straightforward and will give you the power and flexibility that you want for this approach. Firstly: Assuming that what you have is some chunk of audio without any filtering done on it. Just as it would be acquired from a microphone. The first step is to eliminate background noise. There are a number of different methods for this, but I'm going to assume that what you want is something that will work well without being incredibly difficult to implement. <ul> <li>Filter the audio using scipy's filtering module here. There are a lot of frequencies that microphones pick up that are simply not useful for categorizing speech. I would suggest either a Bessel or a Butterworth filter to ensure that your waveform is persevered through filtering. The fundamental frequencies for everyday speech are generally between 800 and 2000 Hz (reference) so a reasonable cutoff would be something like 300 to 4000 Hz, just to make sure you don't lose anything.</li> <li>Look for the least active portion of speech and assume that is a reasonable representation of background noise. At this point you're going to want to run a series of fourier transforms along your data (or generate a spectrogram) and find the part of your speech recording that has the lowest average frequency response. Once you have that snapshot, you should subtract it from all other points in your audio sample.</li> <li>At this point should should have an audio file that is mostly just your user's speech and should be ready to be compared to another file that has gone through this process. Now, we want to actually clip the sound and compare this clip to some master clip.</li> </ul> Secondly: You're going to want to come up with a distance metric between two speech patterns, there are a number of ways to do this, but I'm going to assume we have the output of part one and some master file that has been through similar processing. <ul> <li>Generate a spectrogram of the audio file in question (example). The output from this is ultimately going to be an image that can be represented as a 2-d array of frequency response values. A spectrogram is essentially a fourier transform over time where the colour corresponds to intensity. </li> <li>Use OpenCV (has python bindings, example) to run blob detection on your spectrogram. Effectively this is going to look for the big colorful blob in the middle of your spectrogram, and give you some limits on this. Effectively, what this should do, is return a significantly more sparse version of the original 2d-array that solely represents the speech in question. (With the assumption that your audio file will have some trailing stuff on the front and back ends of recording)</li> <li>Normalize the two blobs to account for differences in speech speed. Everyone talks at a different speeds, and as such your blobs will probably have different sizes along the x-axis (time). This will ultimately introduce a level of checks in your algorithm that you don't want for the speed of speech. This step isn't needed if you also want to make sure that they speak at the same speed as the master copy, but I would suggest it. Basically you want to stretch out the shorter version by multiplying it's time axis by some constant that's just the ratio of the lengths of your two blobs.</li> <li>You should also normalize the two blobs based on maximum and minimum intensity to account for people that talk at different volumes. Again, this is up to your discretion, but to fix this you should find similar ratios for the total span of intensities that you have as well as the two recording's max intensities and make sure that these two values match up between your 2-d arrays.</li> </ul> Third: Now that you have 2-d arrays representing your two speech events, that should in theory contain all of their useful information it's time to directly compare them. Luckily, comparing two matrices is a well-solved problem and there are a number of ways to move forward. <ul> <li>Personally I would suggest using a metric like Cosine Similarity to determine the difference between your two blobs, but that's not the only solution and while it'll give you a quick validation, you can do better.</li> <li>You could try subtracting one matrix from the other and get an evaluation of how much difference there is between them, which would probably be more accurate than simple cosine distance.</li> <li>It might be overkill, but you could assume that there are certain regions of speech that matter more or less for evaluating difference between blobs (it might not matter if someone uses a long i instead of a short i, but a g instead of a k could be a different word entirely). For something like that you'd want to develop a mask for the difference array in the previous step and multiply all your values by that.</li> <li>Whichever method you choose, you can now simply set some difference threshold and make sure that the difference between the two blobs is below your desired threshold. If it is, the captured speech is similar enough to be correct. Otherwise have them try again.</li> </ul> I hope that's helpful, and again, I can't assure you that this is the exact algorithm that a company uses since that information is hugely proprietary and not open for the public, but I can assure you that methods similar to these are used in the very best papers in academia and that these methods will get you a great balance of accuracy and ease of implementation. Let me know if you have any questions, and good luck with your future data science exploits!

How to detect how similar a speech recording is to another speech recording?

Tags:

algorithm

machine-learning

audio

I would like to build a program to detect how close a user's audio recording is to another recording in order to correct the user's pronunciation. For example:

I record myself saying "Good morning"
I let a foreign student record "Good morning"
Compare his recording to mine to see if his pronunciation was good enough.

I've seen this in some language learning tools (I believe Rosetta Stone does this), but how is it done? Note we're only dealing with speech (and not, say, music). What are some algorithms or libraries I should look into?

792

asked Jun 09 '13 14:06

foobar

1 Answers

A lot of people seem to be suggesting some sort of edit distance, which IMO is a totally wrong approach for determining the similarity of two speech patterns, especially for patterns as short as OP is implying. The specific algorithms used by speech-recognition in fact are nearly the opposite of what you would like to use here. The problem in speech recognition is resolving many similar pronunciations to the same representation. The problem here is to take a number of slightly different pronunciations and get some kind of meaningful distance between them.

I've done quite a bit of this stuff for large scale data science, and while I can't comment on exactly how proprietary programs do it, I can comment on how it's done in academia and provide a solution that is straightforward and will give you the power and flexibility that you want for this approach.

Firstly: Assuming that what you have is some chunk of audio without any filtering done on it. Just as it would be acquired from a microphone. The first step is to eliminate background noise. There are a number of different methods for this, but I'm going to assume that what you want is something that will work well without being incredibly difficult to implement.

Filter the audio using scipy's filtering module here. There are a lot of frequencies that microphones pick up that are simply not useful for categorizing speech. I would suggest either a Bessel or a Butterworth filter to ensure that your waveform is persevered through filtering. The fundamental frequencies for everyday speech are generally between 800 and 2000 Hz (reference) so a reasonable cutoff would be something like 300 to 4000 Hz, just to make sure you don't lose anything.
Look for the least active portion of speech and assume that is a reasonable representation of background noise. At this point you're going to want to run a series of fourier transforms along your data (or generate a spectrogram) and find the part of your speech recording that has the lowest average frequency response. Once you have that snapshot, you should subtract it from all other points in your audio sample.
At this point should should have an audio file that is mostly just your user's speech and should be ready to be compared to another file that has gone through this process. Now, we want to actually clip the sound and compare this clip to some master clip.

Secondly: You're going to want to come up with a distance metric between two speech patterns, there are a number of ways to do this, but I'm going to assume we have the output of part one and some master file that has been through similar processing.

Generate a spectrogram of the audio file in question (example). The output from this is ultimately going to be an image that can be represented as a 2-d array of frequency response values. A spectrogram is essentially a fourier transform over time where the colour corresponds to intensity.
Use OpenCV (has python bindings, example) to run blob detection on your spectrogram. Effectively this is going to look for the big colorful blob in the middle of your spectrogram, and give you some limits on this. Effectively, what this should do, is return a significantly more sparse version of the original 2d-array that solely represents the speech in question. (With the assumption that your audio file will have some trailing stuff on the front and back ends of recording)
Normalize the two blobs to account for differences in speech speed. Everyone talks at a different speeds, and as such your blobs will probably have different sizes along the x-axis (time). This will ultimately introduce a level of checks in your algorithm that you don't want for the speed of speech. This step isn't needed if you also want to make sure that they speak at the same speed as the master copy, but I would suggest it. Basically you want to stretch out the shorter version by multiplying it's time axis by some constant that's just the ratio of the lengths of your two blobs.
You should also normalize the two blobs based on maximum and minimum intensity to account for people that talk at different volumes. Again, this is up to your discretion, but to fix this you should find similar ratios for the total span of intensities that you have as well as the two recording's max intensities and make sure that these two values match up between your 2-d arrays.

Third: Now that you have 2-d arrays representing your two speech events, that should in theory contain all of their useful information it's time to directly compare them. Luckily, comparing two matrices is a well-solved problem and there are a number of ways to move forward.

Personally I would suggest using a metric like Cosine Similarity to determine the difference between your two blobs, but that's not the only solution and while it'll give you a quick validation, you can do better.
You could try subtracting one matrix from the other and get an evaluation of how much difference there is between them, which would probably be more accurate than simple cosine distance.
It might be overkill, but you could assume that there are certain regions of speech that matter more or less for evaluating difference between blobs (it might not matter if someone uses a long i instead of a short i, but a g instead of a k could be a different word entirely). For something like that you'd want to develop a mask for the difference array in the previous step and multiply all your values by that.
Whichever method you choose, you can now simply set some difference threshold and make sure that the difference between the two blobs is below your desired threshold. If it is, the captured speech is similar enough to be correct. Otherwise have them try again.

I hope that's helpful, and again, I can't assure you that this is the exact algorithm that a company uses since that information is hugely proprietary and not open for the public, but I can assure you that methods similar to these are used in the very best papers in academia and that these methods will get you a great balance of accuracy and ease of implementation. Let me know if you have any questions, and good luck with your future data science exploits!

115

answered Oct 14 '22 18:10

Slater Victoroff

Related questions
                            
                                Tape-Equilibrium Codility Training [closed]
                            
                                Maximum absolute difference in an array
                            
                                How to rotate a N x N matrix by 90 degrees? [closed]
                            
                                What's the recommended way of iterating a container in C++11?
                            
                                Is Dijkstra's algorithm for directed or undirected graphs?
                            
                                Find Second largest number in array at most n+log₂(n)−2 comparisons [closed]
                            
                                What Are High-Pass and Low-Pass Filters?
                            
                                Determining if two rays intersect
                            
                                Finding highest product of three numbers
                            
                                How to count possible combination for coin problem
                            
                                Finding the intersecting node from two intersecting linked lists
                            
                                Easiest to code algorithm for Rubik's cube?
                            
                                How TDD works when there can be millions of test cases for a production functionality?
                            
                                C++ Efficiently Calculating a Running Median [duplicate]
                            
                                BIT:Using a binary indexed tree? [closed]
                            
                                XIRR Calculation
                            
                                C++ algorithm to calculate least common multiple for multiple numbers
                            
                                Color Logic Algorithm
                            
                                First occurrence in a binary search
                            
                                In Order Successor in Binary Search Tree

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With