Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add word embedding to word2vec gensim model

I'm looking for a way to dinamically add pre-trained word vectors to a word2vec gensim model.

I have a pre-trained word2vec model in a txt (words and their embedding) and I need to get Word Mover's Distance (for example via gensim.models.Word2Vec.wmdistance) between documents in a specific corpus and a new document.

To prevent the need to load the whole vocabulary, I would want to load only the subset of the pre-trained model's words that are found in the corpus. But if the new document has words that are not found in the corpus but they are in the original model vocabulary add them to the model so they are considered in the computation.

What I want is to save RAM, so possible things that would help me:

  • Is there a way to add the word vectors directly to the model?
  • Is there a way to load to gensim from a matrix or another object? I could have that object in RAM and append to it the new words before loading them in the model
  • I don't need it to be on gensim, so if you know a different implementation for WMD that gets the vectors as input that would work (though I do need it in Python)

Thanks in advance.

like image 536
eardil Avatar asked Apr 24 '17 21:04

eardil


People also ask

Is gensim used for word embedding?

Gensim Python Library Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text. It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.

How do I get the embed matrix in Word2Vec?

Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc. Word2Vec consists of models for generating word embedding. These models are shallow two-layer neural networks having one input layer, one hidden layer, and one output layer.

How does gensim Word2Vec work?

Now you could even use Word2Vec to compute similarity between two words in the vocabulary by invoking the similarity(...) function and passing in the relevant words. Under the hood, the above three snippets compute the cosine similarity between the two specified words using word vectors (embeddings) of each.


1 Answers

METHOD 1:

You can just use keyedvectors from gensim.models.keyedvectors. They are very easy to use.

from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)

METHOD 2:

AND if you already have built a model using gensim.models.Word2Vec you can just do this. suppose I want to add the token <UKN> with a random vector.

model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length

The complete example would be like this:

import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec

dataset = api.load("text8")  # load dataset as iterable
model = Word2Vec(dataset)

model.wv["<UNK>"] = np.random.rand(100)
like image 195
Peyman Avatar answered Sep 24 '22 01:09

Peyman