Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gensim word2vec: Find number of words in vocabulary

After training a word2vec model using python gensim, how do you find the number of words in the model's vocabulary?

like image 681
hlin117 Avatar asked Feb 24 '16 07:02

hlin117


People also ask

How many words is word2vec?

They found that Word2vec has a steep learning curve, outperforming another word-embedding technique (LSA) when it is trained with medium to large corpus size (more than 10 million words).

What is vocab in word2vec?

In pre-4.0 versions, the vocabulary was in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So there it was just the usual Python for getting a dictionary's length: len(w2v_model.wv.vocab)

What is min count in word2vec?

min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5. workers: The number of partitions during training and the default workers is 3.


1 Answers

In recent versions, the model.wv property holds the words-and-vectors, and can itself can report a length – the number of words it contains. So if w2v_model is your Word2Vec (or Doc2Vec or FastText) model, it's enough to just do:

vocab_len = len(w2v_model.wv) 

If your model is just a raw set of word-vectors, like a KeyedVectors instance rather than a full Word2Vec/etc model, it's just:

vocab_len = len(kv_model) 

Other useful internals in Gensim 4.0+ include model.wv.index_to_key, a plain list of the key (word) in each index position, and model.wv.key_to_index, a plain dict mapping keys (words) to their index positions.

In pre-4.0 versions, the vocabulary was in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So there it was just the usual Python for getting a dictionary's length:

len(w2v_model.wv.vocab) 

In very-old gensim versions before 0.13 vocab appeared directly on the model. So way back then you would use w2v_model.vocab instead of w2v_model.wv.vocab.

But if you're still using anything from before Gensim 4.0, you should definitely upgrade! There are big memory & performance improvements, and the changes required in calling code are relatively small – some renamings & moves, covered in the 4.0 Migration Notes.

like image 150
gojomo Avatar answered Sep 19 '22 15:09

gojomo