Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word2Vec - adding constraint to vector representation

I am trying to adapt the pre-trained Google News word2vec model to my specific domain. For the domain I am looking at, certain words are known to be similar to each other so in an ideal world, the Word2Vec representation of those words should represent that. I understand that I can train the pre-trained model on a corpus of domain-specific data to update the vectors.

However, if I know for certain that certain words are highly similar and should be together, is there a way for me to incorporate that constraint into the word2vec model? Mathematically, I would like to add a term to the loss function of word2vec that provides a penalty if two that I know to be similar are not positioned close to each other in the vector space. Does anyone have advice on how to implement this? Will this require me to unpack the word2vec model or is there a way for me to potentially add that additional term to the loss function?

like image 803
Ali Avatar asked Oct 17 '22 21:10

Ali


1 Answers

One approach is to take pre-trained Google News word2vec and use this "retrofitting" tool:

Faruqui, Manaal, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. "Retrofitting word vectors to semantic lexicons." arXiv preprint arXiv:1411.4166 (2014). https://arxiv.org/abs/1411.4166

This paper proposes a method for refining vector space representations using relational information from semantic lexicons by encouraging linked words to have similar vector representations, and it makes no assumptions about how the input vectors were constructed.

The code is available at https://github.com/mfaruqui/retrofitting and is straightforward to use (I've personally used it for https://arxiv.org/abs/1607.02802).

like image 140
Franck Dernoncourt Avatar answered Oct 21 '22 03:10

Franck Dernoncourt