I'm having trouble determining from this research paper exactly how I can reproduce the Standard Vector Quantization algorithm to determine the language of an unidentified speech input, based on a training set of data. Here's some basic info:
Abstract info Language recognition (e.g. Japanese, English, German, etc) using acoustic features is an important yet difficult problem for current speech technology. ... The speech data base used in this paper contains 20 languages: 16 sentences uttered twice by 4 males and 4 females. The duration of each sentence is about 8 seconds. The first algorithm is based on the standard Vector Quantization (VQ) technique. Every language is characterized by its own VQ codebook, .
Recognition Algorithms
The first algorithm is based on the standard Vector Quantization (VQ) technique. Every language, k
, is characterized by its own VQ codebook, . In the recognition stage input speech is quantized by and the accumulated quantization distortion, d_k, is calculated. The language which as the minimal distortion is recognized. Calcualating VQ distortion, several LPC spectral distortion measures are applied...in this case, the WLR -- weighted least ratio -- distance:
Standard VQ Algorithm: A codebook,
, for each language is generated using training sentences. The accumulated distance for input vector in sentence, ![alt text][4], is defined as: [![alt text][5]][5]The distance d
can be any distance which corresponds to the acoustic features and it must be the same as the one used for codebook generation. Each language is characterized by its VQ codebook, .
My question is, how exactly do I do this? I have a set of 50 sentences in English. In MATLAB, I can easily calculated the WLR for any given signal. But, how do I formulate a codebook, since I must use the WLR for "codebook generation" for English. I'm also curious as to how to compare a VQ codebook of size 16 (which was found to be the best size), to a given input signal. If anyone could help distill this paper down for me, I'd appreciate it greatly.
Thanks!
The second question (compare codebook to given signal) is more easy: for each codebook entry V_k_j you must calculate distance d with input signal. The 'j' with smallest distance 'd' will corespond to best fitted codebook entry. As a distance function you can use WLR
Building codebook (trainig) is bit more complicated. You must divide you sentences to vectors with lenght N (16) and then use some clustering algorithm (like k-means) to cluster these vectors. Then find mean in every cluster. This mean and will be codebook entry. It is a fisrt thing that comes to mind.
Another algorithm (I believe, it will be better) can be found here. Also, two simple training algorithms are described in Wikipedia
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With