I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.
So my question is, how do I get the embedding weights loaded by gensim into the PyTorch embedding layer.
Thanks in Advance!
Gensim Python Library Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text. It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.
This can mean that for solving semantic NLP tasks, when the training set at hand is sufficiently large (as was the case in the Sentiment Analysis experiments), it is better to use pre-trained word embeddings.
Pretrained Word Embeddings are the embeddings learned in one task that are used for solving another similar task. These embeddings are trained on large datasets, saved, and then used for solving other tasks. That's why pretrained word embeddings are a form of Transfer Learning.
I just wanted to report my findings about loading a gensim embedding with PyTorch.
0.4.0
and newer:From v0.4.0
there is a new function from_pretrained()
which makes loading an embedding very comfortable. Here is an example from the documentation.
import torch import torch.nn as nn # FloatTensor containing pretrained weights weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]]) embedding = nn.Embedding.from_pretrained(weight) # Get embeddings for index 1 input = torch.LongTensor([1]) embedding(input)
The weights from gensim can easily be obtained by:
import gensim model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file') weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated
As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv
:
weights = model.wv
0.3.1
and older:I'm using version 0.3.1
and from_pretrained()
isn't available in this version.
Therefore I created my own from_pretrained
so I can also use it with 0.3.1
.
Code for from_pretrained
for PyTorch versions 0.3.1
or lower:
def from_pretrained(embeddings, freeze=True): assert embeddings.dim() == 2, \ 'Embeddings parameter is expected to be 2-dimensional' rows, cols = embeddings.shape embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols) embedding.weight = torch.nn.Parameter(embeddings) embedding.weight.requires_grad = not freeze return embedding
The embedding can be loaded then just like this:
embedding = from_pretrained(weights)
I hope this is helpful for someone.
I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.
You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With