Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim word2vec in python3 missing vocab

Tags:

I'm using gensim implementation of Word2Vec. I have the following code snippet:

print('training model')
model = Word2Vec(Sentences(start, end))
print('trained model:', model)
print('vocab:', model.vocab.keys())

When I run this in python2, it runs as expected. The final print is all the words in the vocabulary.

However, if I run it in python3, I get an error:

trained model: Word2Vec(vocab=102, size=100, alpha=0.025)
Traceback (most recent call last):
  File "learn.py", line 58, in <module>
    train(to_datetime('-4h'), to_datetime('now'), 'model.out')
  File "learn.py", line 23, in train
    print('vocab:', model.vocab.keys())
AttributeError: 'Word2Vec' object has no attribute 'vocab'

What is going on? Is gensim word2vec not compatible with python3?

like image 256
Sam Lee Avatar asked Feb 28 '17 19:02

Sam Lee


People also ask

What is vocab in Word2Vec?

In pre-4.0 versions, the vocabulary was in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So there it was just the usual Python for getting a dictionary's length: len(w2v_model.wv.vocab)

Can Word2Vec handle unseen words?

Facebook's 'FastText' descendent of the word2vec algorithm can offer better-than-random vectors for unseen words – but it builds such vectors from word fragments (character n-gram vectors), so it does best where shared word roots exist, or where the out-of-vocabulary word is just a typo of a trained word.

How do I train Word2Vec in Python?

Training the Word2Vec model You just instantiate Word2Vec and pass the reviews that we read in the previous step. So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary.


1 Answers

Are you using the same version of gensim in both places? Gensim 1.0.0 moves vocab to a helper object, so whereas in pre-1.0.0 versions of gensim (in Python 2 or 3), you can use:

model.vocab

...in gensim 1.0.0+ you should instead use (in Python 2 or 3)...

model.wv.vocab
like image 114
gojomo Avatar answered Oct 04 '22 01:10

gojomo