Gensim Word2Vec select minor set of word vectors from pretrained model

Tags:

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.

Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?

567

asked Jun 18 '18 17:06

getaway22

1 Answers

Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.

we have all our minor set of words in restricted_word_set(it can be either list or set) and w2v is our model, so here is the function:

import numpy as np

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = np.array(new_vectors)
    w2v.index2entity = np.array(new_index2entity)
    w2v.index2word = np.array(new_index2entity)
    w2v.vectors_norm = np.array(new_vectors_norm)

WARNING: when you first create the model the vectors_norm == None so you will get an error if you use this function there. vectors_norm will get a value of the type numpy.ndarray after the first use. so before using the function try something like most_similar("cat") so that vectors_norm not be equal to None.

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

Usage:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

it can be used for removing some words either.

108

answered Sep 25 '22 16:09

Peyman

Related questions
                            
                                Django restrict data that can be given to model field
                            
                                Use both sample_weight and class_weight simultaneously
                            
                                Convert strings to float in all pandas columns, where this is possible
                            
                                Iterate Over Dictionary
                            
                                How to use ridge detection filter in opencv
                            
                                Python: Why return-type of itemgetter is not consistent
                            
                                how to print a tuple of tuples without brackets
                            
                                What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]
                            
                                Do I need to import submodules directly?
                            
                                Display matplotlib graph in browser
                            
                                What's the purpose of giving an alias to an builtin function in Python
                            
                                How to download files from s3 given the file path using boto3 in python
                            
                                Using `super()` within `__init_subclass__` doesn't find parent's classmethod [duplicate]
                            
                                TypeError: Inheritance a class from URL is forbidden
                            
                                get file metadata from S3 using Python boto
                            
                                How to know the number of tree created in XGBoost
                            
                                Why is super().__init__(*args,**kwargs) being used when class doesn't specify a superclass?
                            
                                How can I get data from Django Headers?
                            
                                pandas read in MultiIndex data from csv file
                            
                                Python 3 - Google Drive API: AttributeError: 'Resource' object has no attribute 'children'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Gensim Word2Vec select minor set of word vectors from pretrained model

Tags:

python

keras

word-embedding

gensim

word2vec

getaway22

People also ask

1 Answers

Peyman

Recent Activity

Donate For Us