I have a set of pre-trained word2vec word vectors and a corpus. I want to use the word vectors to represent words in the corpus. The corpus has some words in it that I don't have trained word vectors for. What's the best way to handle those words for which there is no pre-trained vector?
I've heard several suggestions.
use a vector of zeros for every missing word
use a vector of random numbers for every missing word (with a bunch of suggestions on how to bound those randoms)
an idea I had: take a vector whose values are the mean of all values in that position from all pre-trained vectors
Anyone with experience with the problem have thoughts on how to handle this?
FastText from Facebook assembles word vectors from subword n-grams which allows it to handle out of vocabulary words. See more about this approach at: Out of Vocab Word Embedding
In a pre-trained word2vec
embedding matrix, you can usually use word unk
as index to find a predesigned vector which is often the best vector.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With