Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim Word2Vec select minor set of word vectors from pretrained model

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.

Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?

like image 567
getaway22 Avatar asked Jun 18 '18 17:06

getaway22


People also ask

Is Word2Vec a Pretrained model?

Google's Word2vec Pretrained Word EmbeddingWord2Vec is one of the most popular pretrained word embeddings developed by Google. Word2Vec is trained on the Google News dataset (about 100 billion words).

What is Min_count in Word2Vec?

min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.

Is Bert better than Word2Vec?

Word2Vec will generate the same single vector for the word bank for both the sentences. Whereas, BERT will generate two different vectors for the word bank being used in two different contexts. One vector will be similar to words like money, cash etc. The other vector would be similar to vectors like beach, coast etc.


1 Answers

Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.

we have all our minor set of words in restricted_word_set(it can be either list or set) and w2v is our model, so here is the function:

import numpy as np

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = np.array(new_vectors)
    w2v.index2entity = np.array(new_index2entity)
    w2v.index2word = np.array(new_index2entity)
    w2v.vectors_norm = np.array(new_vectors_norm)

WARNING: when you first create the model the vectors_norm == None so you will get an error if you use this function there. vectors_norm will get a value of the type numpy.ndarray after the first use. so before using the function try something like most_similar("cat") so that vectors_norm not be equal to None.

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

Usage:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

it can be used for removing some words either.

like image 108
Peyman Avatar answered Sep 25 '22 16:09

Peyman