Reduce Google's Word2Vec model with Gensim

Tags:

Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab count down to e.g. 200k words.

I found Word2Vec methods in the gensim package to determine the word frequency and to re-save the model again, but I am not sure how to pop/remove vocab from the pre-trained model before saving it again. I couldn't find any hint in the KeyedVector class and the Word2Vec class for such an operation?

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py

How can I select a subset of the vocabulary of the pre-trained word2vec model?

911

asked Feb 25 '17 17:02

neurix

1 Answers

The GoogleNews word-vectors file format doesn't include frequency info. But, it does seem to be sorted in roughly more-frequent to less-frequent order.

And, load_word2vec_format() offers an optional limit parameter that only reads that many vectors from the given file.

So, the following should do roughly what you've requested:

goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)

answered Oct 20 '22 21:10

gojomo

Related questions
                            
                                How to parse a list of words according to a simplified grammar?
                            
                                Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start? [closed]
                            
                                Italian stemming library in java
                            
                                Is there any best practice to prepare features for text-based classification?
                            
                                Rewriting sentences while retaining semantic meaning
                            
                                How to access topic words only in gensim
                            
                                How to tie word embedding and softmax weights in keras?
                            
                                Is it necessary to do stopwords removal ,Stemming/Lemmatization for text classification while using Spacy,Bert?
                            
                                Why getting different results with MALLET topic inference for single and batch of documents?
                            
                                how to find similar sentences / phrases in R?
                            
                                Visualize Parse Tree Structure
                            
                                Dependency parsing tree in Spacy
                            
                                Sinusoidal embedding - Attention is all you need
                            
                                Extracting Key-Phrases from text based on the Topic with Python
                            
                                Algorithm to understand meaning [closed]
                            
                                Is POS tagging deterministic?
                            
                                Sentence detection using NLP
                            
                                How to filter out words with low tf-idf in a corpus with gensim?
                            
                                What is the acl tag in Stanford dependency parsing?
                            
                                How to split an NLP parse tree to clauses (independent and subordinate)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reduce Google's Word2Vec model with Gensim

Tags:

nlp

gensim

word2vec

neurix

People also ask

1 Answers

gojomo

Recent Activity

Donate For Us