Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reduce Google's Word2Vec model with Gensim

Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab count down to e.g. 200k words.

I found Word2Vec methods in the gensim package to determine the word frequency and to re-save the model again, but I am not sure how to pop/remove vocab from the pre-trained model before saving it again. I couldn't find any hint in the KeyedVector class and the Word2Vec class for such an operation?

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py

How can I select a subset of the vocabulary of the pre-trained word2vec model?

like image 911
neurix Avatar asked Feb 25 '17 17:02

neurix


People also ask

Does Google use Word2Vec?

Word2Vec (short for word to vector) was a technique invented by Google in 2013 for embedding words. It takes as input a word and spits out an n-dimensional coordinate (or “vector”) so that when you plot these word vectors in space, synonyms cluster.

Is Gensim Word2Vec CBOW or skip gram?

The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling: Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.

What is Gensim Word2Vec trained on?

The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabytes file. You can download it from here: GoogleNews-vectors-negative300.

What does Gensim Word2Vec do?

Word2Vec is a widely used word representation technique that uses neural networks under the hood. The resulting word representation or embeddings can be used to infer semantic similarity between words and phrases, expand queries, surface related concepts and more.


1 Answers

The GoogleNews word-vectors file format doesn't include frequency info. But, it does seem to be sorted in roughly more-frequent to less-frequent order.

And, load_word2vec_format() offers an optional limit parameter that only reads that many vectors from the given file.

So, the following should do roughly what you've requested:

goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)
like image 56
gojomo Avatar answered Oct 20 '22 21:10

gojomo