Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between i-vector and d-vector

could someone please explain the difference between i-vector and d-vector? All I know about them is that they are widely used in speaker/speech recognition systems and they are kind of templates for representing speaker information, but I don't know the main differences.

like image 740
Nikas Žalias Avatar asked May 29 '16 10:05

Nikas Žalias


2 Answers

I-vector is a feature that represents the idiosyncratic characteristics of the frame-level features' distributive pattern. I-vector extraction is essentially a dimensionality reduction of the GMM supervector (although the GMM supervector is not extracted when computing the i-vector). It's extracted in a similar manner with the eigenvoice adaptation scheme or the JFA technique, but is extracted per sentence (or input speech sample).

On the other hand, d-vector is extracted using DNN. To extract a d-vector, a DNN model that takes stacked filterbank features (similar to the DNN acoustic model used in ASR) and generates the one-hot speaker label (or the speaker probability) on the output is trained. D-vector is the averaged activation from the last hidden layer of this DNN. So unlike the i-vector framework, this doesn't have any assumptions about the feature's distribution (the i-vector framework assumes that the i-vector, or the latent variable has a Gaussian distribution).

So in conclusion, these are two distinct features extracted from totally different methods or assumptions. I recommend you reading these papers:

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.

E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. G-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in Proc. ICASSP, 2014, pp. 4080-4084.

like image 124
whkang Avatar answered Nov 04 '22 06:11

whkang


I don't know how to properly characterize the d-vector in plain language, but I can help a little.

The identity vector, or i-vector, Is a spectral signature for a particular slice of speech, usually a sliver of a phoneme, rarely (as far as I can see) as large as the entire phoneme. Basically, it's a discrete spectrogram expressed in a form isomorphic to the Gaussian mixture of the time slice.

EDIT

Thanks to those who provided comments and a superior answer. I updated this only to replace the incorrect information from my original attempt.

A d-vector is extracted from a Deep NN, the mean of the feature vectors in the DNN's final hidden layer. This becomes the model for the speaker, used to compare against other speech samples for identification.

like image 7
Prune Avatar answered Nov 04 '22 05:11

Prune