Given a model, e.g. <pre class="prettyprint"><code>from gensim.models.word2vec import Word2Vec documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] texts = [d.lower().split() for d in documents] w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10) </code></pre> It's possible to remove the word from the w2v vocabulary, e.g. <pre class="prettyprint"><code># Originally, it's there. >>> print(w2v_model['graph']) [-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626] >>> print(w2v_model.wv.vocab['graph']) Vocab(count:3, index:5, sample_int:750148289) # Find most similar words. >>> print(w2v_model.most_similar('graph')) [('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)] # We can delete it from the dictionary >>> del w2v_model.wv.vocab['graph'] >>> print(w2v_model['graph']) KeyError: "word 'graph' not in vocabulary" </code></pre> But when we do a similarity on other words after deleting <code>graph</code>, we see the word <code>graph</code> popping up, e.g. <pre class="prettyprint"><code>>>> w2v_model.most_similar('binary') [('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)] </code></pre> How to remove a word completely from a Word2Vec model in gensim? <hr> <h3>Updated</h3> To answer @vumaasha's comment: <blockquote> could you give some details as to why you want to delete a word </blockquote> <ul> <li>Lets say my universe of words in all words in the corpus to learn the dense relations between all words. </li> <li>But when I want to generate the similar words, it should only come from a subset of domain specific word.</li> <li>It's possible to generate more than enough from <code>.most_similar()</code> then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient. </li> <li>It would be better if the word is totally removed from the word vectors then the <code>.most_similar()</code> words won't return words outside of the specific domain.</li> </ul>

I wrote a function which removes words from KeyedVectors which aren't in a predefined word list. <pre class="prettyprint lang-py prettyprint-override"><code>def restrict_w2v(w2v, restricted_word_set): new_vectors = [] new_vocab = {} new_index2entity = [] new_vectors_norm = [] for i in range(len(w2v.vocab)): word = w2v.index2entity[i] vec = w2v.vectors[i] vocab = w2v.vocab[word] vec_norm = w2v.vectors_norm[i] if word in restricted_word_set: vocab.index = len(new_index2entity) new_index2entity.append(word) new_vocab[word] = vocab new_vectors.append(vec) new_vectors_norm.append(vec_norm) w2v.vocab = new_vocab w2v.vectors = new_vectors w2v.index2entity = new_index2entity w2v.index2word = new_index2entity w2v.vectors_norm = new_vectors_norm </code></pre> It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors. Usage: <pre class="prettyprint"><code>w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True) w2v.most_similar("beer") </code></pre> <blockquote> [('beers', 0.8409687876701355), ('lager', 0.7733745574951172), ('Beer', 0.71753990650177), ('drinks', 0.668931245803833), ('lagers', 0.6570086479187012), ('Yuengling_Lager', 0.655455470085144), ('microbrew', 0.6534324884414673), ('Brooklyn_Lager', 0.6501551866531372), ('suds', 0.6497018337249756), ('brewed_beer', 0.6490240097045898)] </blockquote> <pre class="prettyprint"><code>restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"} restrict_w2v(w2v, restricted_word_set) w2v.most_similar("beer") </code></pre> <blockquote> [('lagers', 0.6570085287094116), ('wine', 0.6217695474624634), ('bash', 0.20583480596542358), ('computer', 0.06677375733852386), ('python', 0.005948573350906372)] </blockquote>

There is no direct way to do what you are looking for. However, you are not completely lost. The method <code>most_similar</code> is implemented in the class <code>WordEmbeddingsKeyedVectors</code> (check the link). You can take a look at this method and modify it to suit your needs. The lines shown below perform the actual logic of computing the similar words, you need to replace the variable <code>limited</code> with vectors corresponding to words of your interest. Then you are done <pre class="prettyprint"><code>limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab] dists = dot(limited, mean) if not topn: return dists best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True) </code></pre> Update: <pre class="prettyprint"><code>limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab] </code></pre> If you see this line, it means if <code>restrict_vocab</code> is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, <code>self.vectors_norm</code> is what goes into limited the method most_similar calls another method <code>init_sims</code>. This initializes the value for <code>[self.vector_norm][4]</code> like shown below <pre class="prettyprint"><code> self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL) </code></pre> so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work

How to remove a word completely from a Word2Vec model in gensim?

Tags:

python

dictionary

del

gensim

word2vec

Given a model, e.g.

from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)

It's possible to remove the word from the w2v vocabulary, e.g.

# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433  0.08862179  0.08601206  0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"

But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.

>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]

How to remove a word completely from a Word2Vec model in gensim?

Updated

To answer @vumaasha's comment:

could you give some details as to why you want to delete a word

Lets say my universe of words in all words in the corpus to learn the dense relations between all words.
But when I want to generate the similar words, it should only come from a subset of domain specific word.
It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.
It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.

877

asked Feb 23 '18 05:02

alvas

2 Answers

I wrote a function which removes words from KeyedVectors which aren't in a predefined word list.

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = new_vectors
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    w2v.vectors_norm = new_vectors_norm

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

Usage:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

answered Sep 23 '22 09:09

zsozso

There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.

The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
        dists = dot(limited, mean)
        if not topn:
            return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)

Update:

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]

If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited

the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below

        self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work

answered Sep 25 '22 09:09

vumaasha

Related questions
                            
                                AnalysisException: u"cannot resolve 'name' given input columns: [ list] in sqlContext in spark
                            
                                How to update model parameters with accumulated gradients?
                            
                                Python type hint for (any) class
                            
                                CRITICAL WORKER TIMEOUT error on gunicorn django
                            
                                Why does pd.concat change the resulting datatype from int to float?
                            
                                Should I use get_/set_ prefixes in Python method names?
                            
                                Putting separate python packages into same namespace?
                            
                                Powershell equivalent of python's if __name__ == '__main__':
                            
                                dateutil and pytz give different results
                            
                                pylint: ignore multiple in rcfile
                            
                                How to enable line wrapping in ipython notebook
                            
                                Difference between WSGI utilities and Web Servers [closed]
                            
                                pandas - merging with missing values
                            
                                Python requests: download only if newer
                            
                                Python self and super in multiple inheritance
                            
                                Scipy.sparse.csr_matrix: How to get top ten values and indices?
                            
                                Python: Creating desktop application with HTML GUI [closed]
                            
                                SQLAlchemy ORM: Polymorphic Single Table Inheritance, with fallback to parent class if "polymorphic_identity" is not found
                            
                                Python argparse : How can I get Namespace objects for argument groups separately?
                            
                                How to make an Abstract Class inherit from another Abstract Class in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With