What is the best way to handle missing words when using word embeddings?

Question

I have a set of pre-trained word2vec word vectors and a corpus. I want to use the word vectors to represent words in the corpus. The corpus has some words in it that I don't have trained word vectors for. What's the best way to handle those words for which there is no pre-trained vector?

I've heard several suggestions.

use a vector of zeros for every missing word
use a vector of random numbers for every missing word (with a bunch of suggestions on how to bound those randoms)
an idea I had: take a vector whose values are the mean of all values in that position from all pre-trained vectors

Anyone with experience with the problem have thoughts on how to handle this?

Adnan S · Accepted Answer

FastText from Facebook assembles word vectors from subword n-grams which allows it to handle out of vocabulary words. See more about this approach at: Out of Vocab Word Embedding

qiqi li · Answer

In a pre-trained word2vec embedding matrix, you can usually use word unk as index to find a predesigned vector which is often the best vector.

What is the best way to handle missing words when using word embeddings?

Tags:

machine-learning

deep-learning

nlp

word-embedding

word2vec

TheSneak

2 Answers

Adnan S

qiqi li

Recent Activity

Donate For Us

What is the best way to handle missing words when using word embeddings?

Tags:

machine-learning

deep-learning

nlp

word-embedding

word2vec

TheSneak

2 Answers

Adnan S

qiqi li

Related questions

Recent Activity

Donate For Us