Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CNN: initializing unknown words from word2vec

I came across these slides, presentation from Kim about CNN's using word2vec: http://www.people.fas.harvard.edu/~yoonkim/data/Kim_EMNLP_2014_slides.pdf

On slide 20, the fourth bullet point reads:

Words not in word2vec are initialized randomly from U[−a, a] 
where a is chosen such that the unknown words have the
same variance as words already in word2vec.

Now I am wondering how "a" is being computed and also how the entire vector for the entirely unknown word is computed.

like image 331
Thomas Kern Avatar asked May 06 '26 22:05

Thomas Kern


1 Answers

According to an answer by Mikolov himself, you can initialize the vector based on the space described by the infrequent words. In his answer he mentions that you should average the infrequent words and in that way build the unknown token.

Following up this idea, I think that a refers to the radius of the infrequent words space. What you could do is get the centroid C of the infrequent words (through a mean), calculate the diameter 2*a of the infrequent vector space Q, and generate a random vector u through uniformly distributed samples located within Q.

like image 153
Salvador Medina Avatar answered May 09 '26 03:05

Salvador Medina



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!