Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How I can get the vectors for words that were not present in word2vec vocabulary?

I have check the previous post link but it doesn't seems to work for my case:-

I have pre trained word2vec model:

import gensim    
model = Word2Vec.load('w2v_model')

Now I have a pandas dataframe with keywords:

keyword
corruption
people
budget
cambodia
.......
......

All I want to add the vectors for each keyword in its corresponding columns but when I use model['cambodia'] it throw me error as KeyError: "word 'cambodia' not in vocabulary"

so I have update the keyword as:

model.train(['cambodia'])

But this won't work out for me, when I use model['cambodia']

it still giving an error as KeyError: "word 'cambodia' not in vocabulary". How to update new words into word2vec vocabulary so i can get its vectors? Expected output will be:-

keyword    V1         V2          V3         V4            V5         V6   
corruption 0.07397  0.290874    -0.170812   0.085428    -0.148551   0.38846 
people      ..............................................................
budget      ...........................................................
like image 861
James Avatar asked Jul 04 '18 07:07

James


People also ask

How does Word2vec deal with unknown words?

In the case of word2vec, the vocabulary is comprised of all words in the input corpus, or at least those above the minimum-frequency threshold. Algorithms tend to ignore words that are outside their vocabulary. However there are ways to reframe your problem such that there are essentially no Out-Of-Vocabulary words.

How many hidden layers are there in a Word2vec word embedding model?

Word embeddings are created using a neural network with one input layer, one hidden layer and one output layer.

How are word vectors created?

There are two common ways through which word vectors are generated: Counts of word/context co-occurrences. Predictions of context given word (skip-gram neural network models, i.e. word2vec)

Is Word2vec bag of words?

Word2vec is not a single algorithm but a combination of two techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these are shallow neural networks which map word(s) to the target variable which is also a word(s). Both of these techniques learn weights which act as word vector representations.


1 Answers

You can initial the first vector as [0,0,...0]. And the word that not in vocabulary can set to 0.

keyword    V1         V2          V3         V4            V5         V6  
0          0          0           0           0           0           0
1       0.07397  0.290874    -0.170812   0.085428    -0.148551   0.38846 
2      ..............................................................
3      ...........................................................

You can use two dicts to solve the problem.

word2id['corruption']=1 
vec['corruption']=[0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846]
 ...
word2id['cambodia']=0 
vec['cambodia']=[0 0 0 0 0 0]
like image 197
Wei Chen Avatar answered Oct 20 '22 19:10

Wei Chen