could someone please explain the difference between i-vector and d-vector? All I know about them is that they are widely used in speaker/speech recognition systems and they are kind of templates for representing speaker information, but I don't know the main differences.
I-vector is a feature that represents the idiosyncratic characteristics of the frame-level features' distributive pattern. I-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It's extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).
On the other hand, d-vector is extracted using DNN. To extract a d-vector, a DNN model that takes stacked filterbank features (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) on the output is trained. D-vector is the averaged activation from the last hidden layer of this DNN. So unlike the i-vector framework, this doesn't have any assumptions about the feature's distribution (the i-vector framework assumes that the i-vector, or the latent variable has a Gaussian distribution).
So in conclusion, these are two distinct features extracted from totally different methods or assumptions. I recommend you reading these papers:
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. G-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Proc. ICASSP, 2014, pp. 4080-4084.
I don't know how to properly characterize the d-vector in plain language, but I can help a little.
The identity vector, or i-vector, Is a spectral signature for a particular slice of speech, usually a sliver of a phoneme, rarely (as far as I can see) as large as the entire phoneme. Basically, it's a discrete spectrogram expressed in a form isomorphic to the Gaussian mixture of the time slice.
EDIT
Thanks to those who provided comments and a superior answer. I updated this only to replace the incorrect information from my original attempt.
A d-vector is extracted from a Deep NN, the mean of the feature vectors in the DNN's final hidden layer. This becomes the model for the speaker, used to compare against other speech samples for identification.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With