I am trying to understand the Keras layers better. I am working on a sequence to sequence model where I embed a sentence and pass it to a LSTM that returns sequences. Hereafter, I want to apply a Dense layer to each timestep (word) in the sentence and it seems like TimeDistributed does the job for three-dimensional tensors like this case. In my understanding, Dense layers only work for two-dimensional tensors and TimeDistributed just applies the same dense on every timestep in three dimensions. Could one then not simply flatten the timesteps, apply a dense layer and perform a reshape to obtain the same result or are these not equivalent in some way that I am missing?

Imagine you have a batch of 4 time steps, each containing a 3-element vector. Let's represent that with this: <img src="https://i.stack.imgur.com/wbOQt.png" alt="Input batch"> Now you want to transform this batch using a dense layer, so you get 5 features per time step. The output of the layer can be represented as something like this: <img src="https://i.stack.imgur.com/4qCR8.png" alt="Output batch"> You consider two options, a <code>TimeDistributed</code> dense layer, or reshaping as a flat input, apply a dense layer and reshaping back to time steps. In the first option, you would apply a dense layer with 3 inputs and 5 outputs to every single time step. This could look like this: <img src="https://i.stack.imgur.com/PHfdF.png" alt="TimeDistributed layer"> Each blue circle here is a unit in the dense layer. By doing this with every input time step you get the total output. Importantly, these five units are the same for all the time steps, so you only have the parameters of a single dense layer with 3 inputs and 5 outputs. The second option would involve flattening the input into a 12-element vector, applying a dense layer with 12 inputs and 20 outputs, and then reshaping that back. This is how it would look: <img src="https://i.stack.imgur.com/r590r.png" alt="Flat dense layer"> Here the input connections of only one unit are drawn for clarity, but every unit would be connected to every input. Here, obviously, you have many more parameters (those of a dense layer with 12 inputs and 20 outputs), and also note that each output value is influenced by every input value, so values in one time step would affect outputs in other time steps. Whether this is something good or bad depends on your problem and model, but it is an important difference with respect to the previous, where each time step input and output were independent. In addition to that, this configuration requires you to use a fixed number of time steps on each batch, whereas the previous works independently of the number of time steps. You could also consider the option of having four dense layers, each applied independently to each time step (I didn't draw it but hopefully you get the idea). That would be similar to the previous one, only each unit would receive input connections only from its respective time step inputs. I don't think there is a straightforward way to do that in Keras, you would have to split the input into four, apply dense layers to each part and merge the outputs. Again, in this case the number of time steps would be fixed.

Why not use Flatten followed by a Dense layer instead of TimeDistributed?

Tags:

machine-learning

tensorflow

keras

lstm

keras-layer

I am trying to understand the Keras layers better. I am working on a sequence to sequence model where I embed a sentence and pass it to a LSTM that returns sequences. Hereafter, I want to apply a Dense layer to each timestep (word) in the sentence and it seems like TimeDistributed does the job for three-dimensional tensors like this case.

In my understanding, Dense layers only work for two-dimensional tensors and TimeDistributed just applies the same dense on every timestep in three dimensions. Could one then not simply flatten the timesteps, apply a dense layer and perform a reshape to obtain the same result or are these not equivalent in some way that I am missing?

642

asked Dec 07 '18 13:12

Andreas Olesen

1 Answers

Imagine you have a batch of 4 time steps, each containing a 3-element vector. Let's represent that with this:

Input batch

Now you want to transform this batch using a dense layer, so you get 5 features per time step. The output of the layer can be represented as something like this:

Output batch

You consider two options, a TimeDistributed dense layer, or reshaping as a flat input, apply a dense layer and reshaping back to time steps.

In the first option, you would apply a dense layer with 3 inputs and 5 outputs to every single time step. This could look like this:

TimeDistributed layer

Each blue circle here is a unit in the dense layer. By doing this with every input time step you get the total output. Importantly, these five units are the same for all the time steps, so you only have the parameters of a single dense layer with 3 inputs and 5 outputs.

The second option would involve flattening the input into a 12-element vector, applying a dense layer with 12 inputs and 20 outputs, and then reshaping that back. This is how it would look:

Flat dense layer

Here the input connections of only one unit are drawn for clarity, but every unit would be connected to every input. Here, obviously, you have many more parameters (those of a dense layer with 12 inputs and 20 outputs), and also note that each output value is influenced by every input value, so values in one time step would affect outputs in other time steps. Whether this is something good or bad depends on your problem and model, but it is an important difference with respect to the previous, where each time step input and output were independent. In addition to that, this configuration requires you to use a fixed number of time steps on each batch, whereas the previous works independently of the number of time steps.

You could also consider the option of having four dense layers, each applied independently to each time step (I didn't draw it but hopefully you get the idea). That would be similar to the previous one, only each unit would receive input connections only from its respective time step inputs. I don't think there is a straightforward way to do that in Keras, you would have to split the input into four, apply dense layers to each part and merge the outputs. Again, in this case the number of time steps would be fixed.

answered Sep 28 '22 02:09

jdehesa

Related questions
                            
                                Why do I get CUDA out of memory when running PyTorch model [with enough GPU memory]?
                            
                                tf.keras model.predict results in memory leak
                            
                                Getting the output shape of deconvolution layer using tf.nn.conv2d_transpose in tensorflow
                            
                                Tensorflow freeze_graph script failing on model defined with Keras
                            
                                How to get the currently active tf.variable_scope in TensorFlow?
                            
                                Image similarity detection with TensorFlow
                            
                                Tensorflow.strided_slice missing argument 'strides'?
                            
                                Is it possible to export python and its necessary libraries into a environment independent file?
                            
                                Tensorflow: Linear regression with non-negative constraints
                            
                                tensorflow map_fn TensorArray has inconsistent shapes
                            
                                Printing class name and score in Tensorflow Object Detection API
                            
                                get intermediate output from Keras/Tensorflow during prediction
                            
                                Can the sigmoid activation function be used to solve regression problems in Keras?
                            
                                tensorflow how to merge batchnorm into convolution for faster inference
                            
                                How LSTM work with word embeddings for text classification, example in Keras
                            
                                Keras seems to hang after call to fit_generator
                            
                                Dependencies missing in current linux-64 channels when trying to install tensorflow-gpu with conda command
                            
                                logits and labels must be broadcastable: logits_size=[32,1] labels_size=[16,1]
                            
                                How to use F-score as error function to train neural networks?
                            
                                TensorFlow Object Detection API: specifying multiple data_augmentation_options

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With