Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Embedding in pytorch

Does Embedding make similar words closer to each other? And do I just need to give to it all the sentences? Or it is just a lookup table and I need to code the model?

like image 818
user1927468 Avatar asked Jun 07 '18 18:06

user1927468


People also ask

What is embedding in PyTorch?

PyTorch Embedding is a space with low dimensions where high dimensional vectors can be translated easily so that models can be reused on new problems and can be solved easily. The changes are kept to each single video frame so that the data can be hidden easily in the video frames whenever there are any changes.

What is embedding bag in PyTorch?

Computes sums or means of 'bags' of embeddings, without instantiating the intermediate embeddings. with mode="sum" is equivalent to Embedding followed by torch.

Is PyTorch embedding trainable?

Embedding is a model parameter layer, which is by default trainable, If you want to fine-tune word vectors during training, these word vectors are treated as model parameters and are updated by backpropagation. You can also make it untrainable by freezing its gradient.


2 Answers

nn.Embedding holds a Tensor of dimension (vocab_size, vector_size), i.e. of the size of the vocabulary x the dimension of each vector embedding, and a method that does the lookup.

When you create an embedding layer, the Tensor is initialised randomly. It is only when you train it when this similarity between similar words should appear. Unless you have overwritten the values of the embedding with a previously trained model, like GloVe or Word2Vec, but that's another story.

So, once you have the embedding layer defined, and the vocabulary defined and encoded (i.e. assign a unique number to each word in the vocabulary) you can use the instance of the nn.Embedding class to get the corresponding embedding.

For example:

import torch
from torch import nn
embedding = nn.Embedding(1000,128)
embedding(torch.LongTensor([3,4]))

will return the embedding vectors corresponding to the word 3 and 4 in your vocabulary. As no model has been trained, they will be random.

like image 197
Escachator Avatar answered Oct 17 '22 07:10

Escachator


You could treat nn.Embedding as a lookup table where the key is the word index and the value is the corresponding word vector. However, before using it you should specify the size of the lookup table, and initialize the word vectors yourself. Following is a code example demonstrating this.

import torch.nn as nn 

# vocab_size is the number of words in your train, val and test set
# vector_size is the dimension of the word vectors you are using
embed = nn.Embedding(vocab_size, vector_size)

# intialize the word vectors, pretrained_weights is a 
# numpy array of size (vocab_size, vector_size) and 
# pretrained_weights[i] retrieves the word vector of
# i-th word in the vocabulary
embed.weight.data.copy_(torch.fromnumpy(pretrained_weights))

# Then turn the word index into actual word vector
vocab = {"some": 0, "words": 1}
word_indexes = [vocab[w] for w in ["some", "words"]] 
word_vectors = embed(word_indexes)
like image 46
AveryLiu Avatar answered Oct 17 '22 07:10

AveryLiu