Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get vocabulary word count from gensim word2vec?

I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary?

like image 598
Michelle Owen Avatar asked May 12 '16 15:05

Michelle Owen


2 Answers

Each word in the vocabulary has an associated vocabulary object, which contains an index and a count.

vocab_obj = w2v.vocab["word"]
vocab_obj.count

Output for google news w2v model: 2998437

So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary.

for word, vocab_obj in w2v.vocab.items():
  #Do something with vocab_obj.count
like image 137
user3390629 Avatar answered Oct 03 '22 08:10

user3390629


When you want to create a dictionary of word to count for easy retrieval later, you can do so as follows:

w2c = dict()
for item in model.wv.vocab:
    w2c[item]=model.wv.vocab[item].count

If you want to sort it to see the most frequent words in the model, you can also do that so:

w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
like image 27
Ahmedov Avatar answered Oct 03 '22 10:10

Ahmedov