Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTorch / Gensim - How to load pre-trained word embeddings

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

So my question is, how do I get the embedding weights loaded by gensim into the PyTorch embedding layer.

Thanks in Advance!

like image 808
MBT Avatar asked Apr 07 '18 18:04

MBT


People also ask

Is Gensim used for word embedding?

Gensim Python Library Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text. It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.

Is using pre-trained Embeddings better than using custom trained Embeddings?

This can mean that for solving semantic NLP tasks, when the training set at hand is sufficiently large (as was the case in the Sentiment Analysis experiments), it is better to use pre-trained word embeddings.

What is Pretrained word embeddings?

Pretrained Word Embeddings are the embeddings learned in one task that are used for solving another similar task. These embeddings are trained on large datasets, saved, and then used for solving other tasks. That's why pretrained word embeddings are a form of Transfer Learning.


2 Answers

I just wanted to report my findings about loading a gensim embedding with PyTorch.


  • Solution for PyTorch 0.4.0 and newer:

From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable. Here is an example from the documentation.

import torch import torch.nn as nn  # FloatTensor containing pretrained weights weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]]) embedding = nn.Embedding.from_pretrained(weight) # Get embeddings for index 1 input = torch.LongTensor([1]) embedding(input) 

The weights from gensim can easily be obtained by:

import gensim model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file') weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated 

As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv:

weights = model.wv 

  • Solution for PyTorch version 0.3.1 and older:

I'm using version 0.3.1 and from_pretrained() isn't available in this version.

Therefore I created my own from_pretrained so I can also use it with 0.3.1.

Code for from_pretrained for PyTorch versions 0.3.1 or lower:

def from_pretrained(embeddings, freeze=True):     assert embeddings.dim() == 2, \          'Embeddings parameter is expected to be 2-dimensional'     rows, cols = embeddings.shape     embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)     embedding.weight = torch.nn.Parameter(embeddings)     embedding.weight.requires_grad = not freeze     return embedding 

The embedding can be loaded then just like this:

embedding = from_pretrained(weights) 

I hope this is helpful for someone.

like image 74
MBT Avatar answered Sep 21 '22 15:09

MBT


I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.

You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.

like image 34
jdhao Avatar answered Sep 20 '22 15:09

jdhao