I came across these slides, presentation from Kim about CNN's using word2vec: http://www.people.fas.harvard.edu/~yoonkim/data/Kim_EMNLP_2014_slides.pdf
On slide 20, the fourth bullet point reads:
Words not in word2vec are initialized randomly from U[−a, a]
where a is chosen such that the unknown words have the
same variance as words already in word2vec.
Now I am wondering how "a" is being computed and also how the entire vector for the entirely unknown word is computed.
According to an answer by Mikolov himself, you can initialize the vector based on the space described by the infrequent words. In his answer he mentions that you should average the infrequent words and in that way build the unknown token.
Following up this idea, I think that a refers to the radius of the infrequent words space. What you could do is get the centroid C of the infrequent words (through a mean), calculate the diameter 2*a of the infrequent vector space Q, and generate a random vector u through uniformly distributed samples located within Q.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With