I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary?
Each word in the vocabulary has an associated vocabulary object, which contains an index and a count.
vocab_obj = w2v.vocab["word"]
vocab_obj.count
Output for google news w2v model: 2998437
So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary.
for word, vocab_obj in w2v.vocab.items():
#Do something with vocab_obj.count
When you want to create a dictionary of word to count for easy retrieval later, you can do so as follows:
w2c = dict()
for item in model.wv.vocab:
w2c[item]=model.wv.vocab[item].count
If you want to sort it to see the most frequent words in the model, you can also do that so:
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With