Loading the complete pre-trained word2vec model by Google is time intensive and tedious, therefore I was wondering if there is a chance to remove words below a certain frequency to bring the vocab
count down to e.g. 200k words.
I found Word2Vec methods in the gensim
package to determine the word frequency and to re-save the model again, but I am not sure how to pop
/remove
vocab from the pre-trained model before saving it again. I couldn't find any hint in the KeyedVector class
and the Word2Vec class
for such an operation?
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py
How can I select a subset of the vocabulary of the pre-trained word2vec model?
Word2Vec (short for word to vector) was a technique invented by Google in 2013 for embedding words. It takes as input a word and spits out an n-dimensional coordinate (or “vector”) so that when you plot these word vectors in space, synonyms cluster.
The word2vec algorithms include skip-gram and CBOW models, using either hierarchical softmax or negative sampling: Tomas Mikolov et al: Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov et al: Distributed Representations of Words and Phrases and their Compositionality.
The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabytes file. You can download it from here: GoogleNews-vectors-negative300.
Word2Vec is a widely used word representation technique that uses neural networks under the hood. The resulting word representation or embeddings can be used to infer semantic similarity between words and phrases, expand queries, surface related concepts and more.
The GoogleNews word-vectors file format doesn't include frequency info. But, it does seem to be sorted in roughly more-frequent to less-frequent order.
And, load_word2vec_format()
offers an optional limit
parameter that only reads that many vectors from the given file.
So, the following should do roughly what you've requested:
goognews_wordecs = KeyedVectors.load_word2vec_format(`GoogleNews-vectors-negative300.bin.gz`, binary=True, limit=200000)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With