Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between an Embedding Layer and a Dense Layer?

The docs for an Embedding Layer in Keras say:

Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

I believe this could also be achieved by encoding the inputs as one-hot vectors of length vocabulary_size, and feeding them into a Dense Layer.

Is an Embedding Layer merely a convenience for this two-step process, or is something fancier going on under the hood?

like image 233
Imran Avatar asked Dec 18 '17 12:12

Imran


People also ask

Is embedding layer same as dense layer?

Using an embedding layer is equivalent to using a dense layer with a very specific type of input. First, some notation. You have n symbols (characters, words, bytes, or some other categorical data) that you wish to embed (represent) as a floating point vector with m elements.

What is embedded layer?

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments: It must specify 3 arguments: input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.

What is a dense embedding?

An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer).

What is the purpose of embedding layer and dense layer?

Embedding layer enables us to convert each word into a fixed length vector of defined size. The resultant vector is a dense one with having real values instead of just 0's and 1's. The fixed length of word vectors helps us to represent words in a better way along with reduced dimensions.


3 Answers

An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions.

Imagine a word-to-embedding layer with these weights:

w = [[0.1, 0.2, 0.3, 0.4],      [0.5, 0.6, 0.7, 0.8],      [0.9, 0.0, 0.1, 0.2]] 

A Dense layer will treat these like actual weights with which to perform matrix multiplication. An embedding layer will simply treat these weights as a list of vectors, each vector representing one word; the 0th word in the vocabulary is w[0], 1st is w[1], etc.


For an example, use the weights above and this sentence:

[0, 2, 1, 2] 

A naive Dense-based net needs to convert that sentence to a 1-hot encoding

[[1, 0, 0],  [0, 0, 1],  [0, 1, 0],  [0, 0, 1]] 

then do a matrix multiplication

[[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2],  [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2],  [0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2],  [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]] 

=

[[0.1, 0.2, 0.3, 0.4],  [0.9, 0.0, 0.1, 0.2],  [0.5, 0.6, 0.7, 0.8],  [0.9, 0.0, 0.1, 0.2]] 

However, an Embedding layer simply looks at [0, 2, 1, 2] and takes the weights of the layer at indices zero, two, one, and two to immediately get

[w[0],  w[2],  w[1],  w[2]] 

=

[[0.1, 0.2, 0.3, 0.4],  [0.9, 0.0, 0.1, 0.2],  [0.5, 0.6, 0.7, 0.8],  [0.9, 0.0, 0.1, 0.2]] 

So it's the same result, just obtained in a hopefully faster way.


The Embedding layer does have limitations:

  • The input needs to be integers in [0, vocab_length).
  • No bias.
  • No activation.

However, none of those limitations should matter if you just want to convert an integer-encoded word into an embedding.

like image 137
The Guy with The Hat Avatar answered Sep 19 '22 07:09

The Guy with The Hat


Mathematically, the difference is this:

  • An embedding layer performs select operation. In keras, this layer is equivalent to:

    K.gather(self.embeddings, inputs)      # just one matrix
    
  • A dense layer performs dot-product operation, plus an optional activation:

    outputs = matmul(inputs, self.kernel)  # a kernel matrix
    outputs = bias_add(outputs, self.bias) # a bias vector
    return self.activation(outputs)        # an activation function
    

You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, gather operation itself is faster than matrix dot-product, both in forward and backward pass.

like image 39
Maxim Avatar answered Sep 19 '22 07:09

Maxim


Here I want to improve the voted answer by providing more details:

When we use embedding layer, it is generally to reduce one-hot input vectors (sparse) to denser representations.

  1. Embedding layer is much like a table lookup. When the table is small, it is fast.

  2. When the table is large, table lookup is much slower. In practice, we would use dense layer as a dimension reducer to reduce the one-hot input instead of embedding layer in this case.

like image 31
kiryu nil Avatar answered Sep 21 '22 07:09

kiryu nil