Given two recorded voices in digital format, is there an algorithm to compare the two and return a coefficient of similarity?
In one of the works [10], speech pre-processing method was considered using the VAD algorithm, which proves that this algorithm improves the performance of speech recognition.
The process usually involves at least one of the following tasks: transcribing a recorded voice, comparing an intercepted voice to that of a suspect, putting the suspect's voice in a lineup of different voices, profiling a speaker based on dialect or language spoken, interpreting noises or verifying the authenticity of ...
The model checks and rechecks all the probabilities to come up with the most likely text that was spoken. You can see this in real-time when you dictate into your phone's assistant. You may notice that the words at the beginning of your phrase start changing as the system tries to understand what you say.
The ML algorithm establishes the connection between phonemes and sounds, giving them accurate intonations. The system uses a sound wave generator to create a vocal sound. The frequency characteristics of phrases obtained from the acoustic model are eventually loaded into the sound wave generator.
I recommend to take a look into the HTK toolkit for speech recognition http://htk.eng.cam.ac.uk/, especially the part on feature extraction.
Features that I would assume to be good indicators:
Given your clarification I think what you are looking for falls under speech recognition algorithms.
Even though you are only looking for the measure of similarity and not trying to turn speech into text, still the concepts are the same and I would not be surprised if a large part of the algorithms would be quite useful.
However, you will have to define this coefficient of similarity more formally and precisely to get anywhere.
EDIT: I believe speech recognition algorithms would be useful because they do abstraction of the sound and comparison to some known forms. Conceptually this might not be that different from taking two recordings, abstracting them and comparing them.
From wikipedia article on HMM
"In speech recognition, the hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients."
So if you run such an algorithm on both recordings you would end up with coefficients that represent the recordings and it might be far easier to measure and establish similarities between the two.
But again now you come to the question of defining the 'similarity coefficient' and introducing dogs and horses did not really help.
(Well it does a bit, but in terms of evaluating algorithms and choosing one over another, you will have to do better).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With