Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim (word2vec) retrieve n most frequent words

Tags:

gensim

How is it possible to retrieve the n most frequent words from a Gensim word2vec model? As I understand, the frequency and count are not the same, and I therefore can't use the object.count() method.

I need to produce a list of the n most frequent words from my word2vec model.

Edit:

I've tried the following:

w2c = dict()
for item in model.wv.vocab:
   w2c[item]=model.wv.vocab[item].count
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
w2cSortedList = list(w2cSorted.keys())

My initial guess was to use code above, but this implements the count method. I'm not sure if this represents the most frequent words.

like image 689
Phils19 Avatar asked Dec 04 '18 21:12

Phils19


People also ask

Is Gensim Word2Vec CBOW or skip-gram?

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling: Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.

How do you evaluate Word2Vec?

To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.

What is Keyedvector?

The structure is called “KeyedVectors” and is essentially a mapping between keys and vectors. Each vector is identified by its lookup key, most often a short string token, so this is usually a mapping between {str => 1D numpy array}.

What is Alpha in Word2Vec?

I know that alpha is the initial learning rate and its default value is 0.075 form Radim blog.


1 Answers

The .count property of each vocab-entries is the count of that word as seen during the initial vocabulary-survey. So sorting by that, and taking the highest-count words, will give you the most-frequent words.

But also, for efficiency, it's typical practice for the ordered-list of known-words to be ordered from most- to least-frequent. You can view this at the list model.wv.index_to_key, so can retrieve the 100 most frequent words by model.wv.index_to_key[:100]. (In Gensim before version 4.0, this same list was called either index2entity or index2word.)

like image 145
gojomo Avatar answered Sep 23 '22 06:09

gojomo