Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to handle missing words when using word embeddings?

I have a set of pre-trained word2vec word vectors and a corpus. I want to use the word vectors to represent words in the corpus. The corpus has some words in it that I don't have trained word vectors for. What's the best way to handle those words for which there is no pre-trained vector?

I've heard several suggestions.

  1. use a vector of zeros for every missing word

  2. use a vector of random numbers for every missing word (with a bunch of suggestions on how to bound those randoms)

  3. an idea I had: take a vector whose values are the mean of all values in that position from all pre-trained vectors

Anyone with experience with the problem have thoughts on how to handle this?

like image 528
TheSneak Avatar asked Feb 09 '18 01:02

TheSneak


2 Answers

FastText from Facebook assembles word vectors from subword n-grams which allows it to handle out of vocabulary words. See more about this approach at: Out of Vocab Word Embedding

like image 172
Adnan S Avatar answered Oct 18 '22 03:10

Adnan S


In a pre-trained word2vec embedding matrix, you can usually use word unk as index to find a predesigned vector which is often the best vector.

like image 22
qiqi li Avatar answered Oct 18 '22 03:10

qiqi li