I have check the previous post link but it doesn't seems to work for my case:-
I have pre trained word2vec model:
import gensim
model = Word2Vec.load('w2v_model')
Now I have a pandas dataframe with keywords:
keyword
corruption
people
budget
cambodia
.......
......
All I want to add the vectors for each keyword in its corresponding columns but
when I use model['cambodia']
it throw me error as KeyError: "word 'cambodia' not in vocabulary"
so I have update the keyword as:
model.train(['cambodia'])
But this won't work out for me, when I use
model['cambodia']
it still giving an error as KeyError: "word 'cambodia' not in vocabulary"
. How to update new words into word2vec vocabulary so i can get its vectors? Expected output will be:-
keyword V1 V2 V3 V4 V5 V6
corruption 0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846
people ..............................................................
budget ...........................................................
In the case of word2vec, the vocabulary is comprised of all words in the input corpus, or at least those above the minimum-frequency threshold. Algorithms tend to ignore words that are outside their vocabulary. However there are ways to reframe your problem such that there are essentially no Out-Of-Vocabulary words.
Word embeddings are created using a neural network with one input layer, one hidden layer and one output layer.
There are two common ways through which word vectors are generated: Counts of word/context co-occurrences. Predictions of context given word (skip-gram neural network models, i.e. word2vec)
Word2vec is not a single algorithm but a combination of two techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these are shallow neural networks which map word(s) to the target variable which is also a word(s). Both of these techniques learn weights which act as word vector representations.
You can initial the first vector as [0,0,...0]. And the word that not in vocabulary can set to 0.
keyword V1 V2 V3 V4 V5 V6
0 0 0 0 0 0 0
1 0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846
2 ..............................................................
3 ...........................................................
You can use two dicts to solve the problem.
word2id['corruption']=1
vec['corruption']=[0.07397 0.290874 -0.170812 0.085428 -0.148551 0.38846]
...
word2id['cambodia']=0
vec['cambodia']=[0 0 0 0 0 0]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With