Add word embedding to word2vec gensim model

Tags:

I'm looking for a way to dinamically add pre-trained word vectors to a word2vec gensim model.

I have a pre-trained word2vec model in a txt (words and their embedding) and I need to get Word Mover's Distance (for example via gensim.models.Word2Vec.wmdistance) between documents in a specific corpus and a new document.

To prevent the need to load the whole vocabulary, I would want to load only the subset of the pre-trained model's words that are found in the corpus. But if the new document has words that are not found in the corpus but they are in the original model vocabulary add them to the model so they are considered in the computation.

What I want is to save RAM, so possible things that would help me:

Is there a way to add the word vectors directly to the model?
Is there a way to load to gensim from a matrix or another object? I could have that object in RAM and append to it the new words before loading them in the model
I don't need it to be on gensim, so if you know a different implementation for WMD that gets the vectors as input that would work (though I do need it in Python)

Thanks in advance.

536

asked Apr 24 '17 21:04

eardil

1 Answers

METHOD 1:

You can just use keyedvectors from gensim.models.keyedvectors. They are very easy to use.

from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)

METHOD 2:

AND if you already have built a model using gensim.models.Word2Vec you can just do this. suppose I want to add the token <UKN> with a random vector.

model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length

The complete example would be like this:

import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec

dataset = api.load("text8")  # load dataset as iterable
model = Word2Vec(dataset)

model.wv["<UNK>"] = np.random.rand(100)

195

answered Sep 24 '22 01:09

Peyman

Related questions
                            
                                Why Django migration alter field (AlterField) that is not touched?
                            
                                Are there best practices for extensible magic methods in python?
                            
                                Mock a connection class in pytest
                            
                                Pandas select rows where query is in column of tuples
                            
                                How in Django/Python can I ensure safety from WYSIWYG-entered HTML?
                            
                                Naive install of PySpark to also support S3 access
                            
                                Is definition order available in a module namespace?
                            
                                Python flask ajax get image - last EDIT is the issue
                            
                                Accessing RNN weights- Tensorflow
                            
                                Why is using tanh definition of logistic sigmoid faster than scipy's expit?
                            
                                Broadcast a user defined class in Spark
                            
                                subprocess not running the command generated though the command works on terminal
                            
                                Running Python startup code after modules are loaded
                            
                                Variables with dynamic shape TensorFlow
                            
                                Uniform Cost Search in Python
                            
                                Modify Held-Karp TSP algorithm so we do not need to go back to the origin
                            
                                apply a function on rolling window in Dataframe where whole dataframe is passed to function
                            
                                Scrapy python csv output has blank lines between each row
                            
                                Adaptive Histogram Equalization in Python
                            
                                Numpy: how delete rows common to 2 matrices

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Add word embedding to word2vec gensim model

Tags:

python

nlp

word2vec