First of all I'd like to state that my question is not per say about the "classic" definition of voice recognition.
What we are trying to do is somewhat different, in the sense of:
For example, I record a voice command for calling my mom, so I click on her and say "Mom". Then when I use the program and say "Mom", it will automatically call her.
How would I perform the comparison of a spoken command to a saved voice sample?
EDIT: We have no need for any "text-to-speech" abilities, solely a comparison of sound signals. Obviously we're looking for some sort of a off-the-shelf product or framework.
One way this is done for music recognition is to take a time sequence of frequency spectrums (time windowed STFT FFTs) for the two sounds in question, map the locations of the frequency peaks over the time axis, and cross-correlate the two 2D time-frequency peak mappings for a match. This is far more robust than just cross-correlating the 2 sound samples, as the peaks change far less than all the spectral "cruft" between the spectral peaks. This method will work better if the rate of the two utterances and their pitch haven't changed too much.
In iOS 4.x, you can use the Accelerate framework for the FFTs and maybe the 2D cross correlations as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With