The docs for an Embedding Layer in Keras say: <blockquote> Turns positive integers (indexes) into dense vectors of fixed size. eg. <code>[[4], [20]]</code> -> <code>[[0.25, 0.1], [0.6, -0.2]]</code> </blockquote> I believe this could also be achieved by encoding the inputs as one-hot vectors of length <code>vocabulary_size</code>, and feeding them into a Dense Layer. Is an Embedding Layer merely a convenience for this two-step process, or is something fancier going on under the hood?

An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions. Imagine a word-to-embedding layer with these weights: <pre class="prettyprint"><code>w = [[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8], [0.9, 0.0, 0.1, 0.2]] </code></pre> A <code>Dense</code> layer will treat these like actual weights with which to perform matrix multiplication. An embedding layer will simply treat these weights as a list of vectors, each vector representing one word; the 0th word in the vocabulary is <code>w[0]</code>, 1st is <code>w[1]</code>, etc. <hr> For an example, use the weights above and this sentence: <pre class="prettyprint"><code>[0, 2, 1, 2] </code></pre> A naive <code>Dense</code>-based net needs to convert that sentence to a 1-hot encoding <pre class="prettyprint"><code>[[1, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 1]] </code></pre> then do a matrix multiplication <pre class="prettyprint"><code>[[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2], [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2], [0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2], [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]] </code></pre> = <pre class="prettyprint"><code>[[0.1, 0.2, 0.3, 0.4], [0.9, 0.0, 0.1, 0.2], [0.5, 0.6, 0.7, 0.8], [0.9, 0.0, 0.1, 0.2]] </code></pre> <hr> However, an <code>Embedding</code> layer simply looks at <code>[0, 2, 1, 2]</code> and takes the weights of the layer at indices zero, two, one, and two to immediately get <pre class="prettyprint"><code>[w[0], w[2], w[1], w[2]] </code></pre> = <pre class="prettyprint"><code>[[0.1, 0.2, 0.3, 0.4], [0.9, 0.0, 0.1, 0.2], [0.5, 0.6, 0.7, 0.8], [0.9, 0.0, 0.1, 0.2]] </code></pre> So it's the same result, just obtained in a hopefully faster way. <hr> The <code>Embedding</code> layer does have limitations: <ul> <li>The input needs to be integers in [0, vocab_length).</li> <li>No bias.</li> <li>No activation.</li> </ul> However, none of those limitations should matter if you just want to convert an integer-encoded word into an embedding.

Mathematically, the difference is this: <ul> <li> An embedding layer performs select operation. In keras, this layer is equivalent to: <pre class="prettyprint"><code>K.gather(self.embeddings, inputs) # just one matrix </code></pre> </li> <li> A dense layer performs dot-product operation, plus an optional activation: <pre class="prettyprint"><code>outputs = matmul(inputs, self.kernel) # a kernel matrix outputs = bias_add(outputs, self.bias) # a bias vector return self.activation(outputs) # an activation function </code></pre> </li> </ul> You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, <code>gather</code> operation itself is faster than matrix dot-product, both in forward and backward pass.

Here I want to improve the voted answer by providing more details: When we use embedding layer, it is generally to reduce one-hot input vectors (sparse) to denser representations. <ol> <li> Embedding layer is much like a table lookup. When the table is small, it is fast. </li> <li> When the table is large, table lookup is much slower. In practice, we would use dense layer as a dimension reducer to reduce the one-hot input instead of embedding layer in this case. </li> </ol>

What is the difference between an Embedding Layer and a Dense Layer?

3 Answers

An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions.

Imagine a word-to-embedding layer with these weights:

w = [[0.1, 0.2, 0.3, 0.4],      [0.5, 0.6, 0.7, 0.8],      [0.9, 0.0, 0.1, 0.2]]

A Dense layer will treat these like actual weights with which to perform matrix multiplication. An embedding layer will simply treat these weights as a list of vectors, each vector representing one word; the 0th word in the vocabulary is w[0], 1st is w[1], etc.

For an example, use the weights above and this sentence:

[0, 2, 1, 2]

A naive Dense-based net needs to convert that sentence to a 1-hot encoding

[[1, 0, 0],  [0, 0, 1],  [0, 1, 0],  [0, 0, 1]]

then do a matrix multiplication

[[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2],  [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2],  [0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2],  [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]]

[[0.1, 0.2, 0.3, 0.4],  [0.9, 0.0, 0.1, 0.2],  [0.5, 0.6, 0.7, 0.8],  [0.9, 0.0, 0.1, 0.2]]

However, an Embedding layer simply looks at [0, 2, 1, 2] and takes the weights of the layer at indices zero, two, one, and two to immediately get

[w[0],  w[2],  w[1],  w[2]]

[[0.1, 0.2, 0.3, 0.4],  [0.9, 0.0, 0.1, 0.2],  [0.5, 0.6, 0.7, 0.8],  [0.9, 0.0, 0.1, 0.2]]

So it's the same result, just obtained in a hopefully faster way.

The Embedding layer does have limitations:

The input needs to be integers in [0, vocab_length).
No bias.
No activation.

However, none of those limitations should matter if you just want to convert an integer-encoded word into an embedding.

137

answered Sep 19 '22 07:09

The Guy with The Hat

Mathematically, the difference is this:

An embedding layer performs select operation. In keras, this layer is equivalent to:
```
K.gather(self.embeddings, inputs)      # just one matrix
```

A dense layer performs dot-product operation, plus an optional activation:

outputs = matmul(inputs, self.kernel)  # a kernel matrix
outputs = bias_add(outputs, self.bias) # a bias vector
return self.activation(outputs)        # an activation function

You can emulate an embedding layer with fully-connected layer via one-hot encoding, but the whole point of dense embedding is to avoid one-hot representation. In NLP, the word vocabulary size can be of the order 100k (sometimes even a million). On top of that, it's often needed to process the sequences of words in a batch. Processing the batch of sequences of word indices would be much more efficient than the batch of sequences of one-hot vectors. In addition, gather operation itself is faster than matrix dot-product, both in forward and backward pass.

answered Sep 19 '22 07:09

Maxim

Here I want to improve the voted answer by providing more details:

When we use embedding layer, it is generally to reduce one-hot input vectors (sparse) to denser representations.

Embedding layer is much like a table lookup. When the table is small, it is fast.
When the table is large, table lookup is much slower. In practice, we would use dense layer as a dimension reducer to reduce the one-hot input instead of embedding layer in this case.

answered Sep 21 '22 07:09

kiryu nil

Related questions
                            
                                Real world typo statistics? [closed]
                            
                                How to serve a Spark MLlib model?
                            
                                What is "epoch" in keras.models.Model.fit?
                            
                                Deep Belief Networks vs Convolutional Neural Networks
                            
                                Recommended package for very large dataset processing and machine learning in R [closed]
                            
                                Can Keras deal with input images with different size?
                            
                                Publicly Available Spam Filter Training Set [closed]
                            
                                setting values for ntree and mtry for random forest regression model
                            
                                What's the difference between scikit-learn and tensorflow? Is it possible to use them together?
                            
                                How Could One Implement the K-Means++ Algorithm?
                            
                                ModuleNotFoundError: No module named 'numpy.testing.nosetester'
                            
                                LSTM Autoencoder
                            
                                Why input is scaled in tf.nn.dropout in tensorflow?
                            
                                scikit-learn: Predicting new points with DBSCAN
                            
                                How to convert numpy arrays to standard TensorFlow format?
                            
                                How to get a normal distribution within a range in numpy? [duplicate]
                            
                                How to install CUDA in Google Colab GPU's
                            
                                How to tune parameters in Random Forest, using Scikit Learn?
                            
                                How to find probability distribution and parameters for real data? (Python 3)
                            
                                How do you use Keras LeakyReLU in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between an Embedding Layer and a Dense Layer?

Tags:

machine-learning

neural-network

deep-learning

keras

keras-layer

Imran

People also ask

3 Answers

The Guy with The Hat

Maxim

kiryu nil

Recent Activity

Donate For Us