Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are Embeddings in PyTorch implemented as Sparse Layers?

Embedding Layers in PyTorch are listed under "Sparse Layers" with the limitation:

Keep in mind that only a limited number of optimizers support sparse gradients: currently it’s optim.SGD (cuda and cpu), and optim.Adagrad (cpu)

What is the reason for this? For example in Keras I can train an architecture with an Embedding Layer using any optimizer.

like image 511
Imran Avatar asked Dec 18 '17 12:12

Imran


People also ask

What does embedding layer do in PyTorch?

Uses of PyTorch Embedding We can say that the embedding layer works like a lookup table where each word are converted to numbers, and these numbers can be used to make up the table. Thus, keys are represented by words, and the values are word vectors.

What is the difference between an embedding layer and a dense layer?

While a Dense layer considers W as an actual weight matrix, an Embedding layer considers W as a simple lookup table. Moreover, it considers each integer in the Input array as the index where to find the relative weights in the lookup table.

Why do we use embedding layer?

Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

Why do we use embedding layer in Lstm?

An LSTM network is a type of recurrent neural network (RNN) that can learn long-term dependencies between time steps of sequence data. A word embedding layer maps a sequence of word indices to embedding vectors and learns the word embedding during training.


1 Answers

Upon closer inspection sparse gradients on Embeddings are optional and can be turned on or off with the sparse parameter:

class torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2, scale_grad_by_freq=False, sparse=False)

Where:

sparse (boolean, optional) – if True, gradient w.r.t. weight matrix will be a sparse tensor. See Notes for more details regarding sparse gradients.

And the "Notes" mentioned are what I quoted in the question about a limited number of optimizers being supported for sparse gradients.

Update:

It is theoretically possible but technically difficult to implement some optimization methods on sparse gradients. There is an open issue in the PyTorch repo to add support for all optimizers.

Regarding the original question, I believe Embeddings can be treated as sparse because it is possible to operate on the input indices directly rather than converting them to one-hot encodings for input into a dense layer. This is explained in @Maxim's answer to my related question.

like image 113
Imran Avatar answered Oct 02 '22 19:10

Imran