Gensim (word2vec) retrieve n most frequent words

Tags:

gensim

How is it possible to retrieve the n most frequent words from a Gensim word2vec model? As I understand, the frequency and count are not the same, and I therefore can't use the object.count() method.

I need to produce a list of the n most frequent words from my word2vec model.

Edit:

I've tried the following:

w2c = dict()
for item in model.wv.vocab:
   w2c[item]=model.wv.vocab[item].count
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
w2cSortedList = list(w2cSorted.keys())

My initial guess was to use code above, but this implements the count method. I'm not sure if this represents the most frequent words.

689

asked Dec 04 '18 21:12

Phils19

1 Answers

The .count property of each vocab-entries is the count of that word as seen during the initial vocabulary-survey. So sorting by that, and taking the highest-count words, will give you the most-frequent words.

But also, for efficiency, it's typical practice for the ordered-list of known-words to be ordered from most- to least-frequent. You can view this at the list model.wv.index_to_key, so can retrieve the 100 most frequent words by model.wv.index_to_key[:100]. (In Gensim before version 4.0, this same list was called either index2entity or index2word.)

145

answered Sep 23 '22 06:09

gojomo

Related questions
                            
                                AttributeError: 'list' object has no attribute 'lower' gensim
                            
                                gensim: custom similarity measure
                            
                                How I can get the vectors for words that were not present in word2vec vocabulary?
                            
                                How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?
                            
                                Finding topics of an unseen document via Gensim
                            
                                Understanding LDA Transformed Corpus in Gensim
                            
                                How much data is actually required to train a doc2Vec model?
                            
                                gensim - Word2vec continue training on existing model - AttributeError: 'Word2Vec' object has no attribute 'compute_loss'
                            
                                Python/Gensim - What is the meaning of syn0 and syn0norm?
                            
                                Measure similarity between two documents using Doc2Vec
                            
                                Are there any efficient python libraries for Dynamic Topic Models, preferably extending Gensim?
                            
                                How to get document_topics distribution of all of the document in gensim LDA?
                            
                                Gensim LDA Coherence Score Nan
                            
                                Is it possible to use gensim word2vec model in deeplearning4j.word2vec?
                            
                                Word2vec Gensim Accuracy Analysis
                            
                                Loss does not decrease during training (Word2Vec, Gensim)
                            
                                Gensim Dictionary Implementation
                            
                                Doc2vec: Only 10 docvecs in gensim doc2vec model?
                            
                                What does epochs mean in Doc2Vec and train when I have to manually run the iteration?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With