TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

Tags:

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

My simplified model is the following:

InputSize = 15 MaxLen = 64 HiddenSize = 16  inputs = keras.layers.Input(shape=(MaxLen, InputSize)) x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs) x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x) predictions = keras.layers.Activation('softmax')(x)

The summary of the network is:

_________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= input_1 (InputLayer)         (None, 64, 15)            0          _________________________________________________________________ gru_1 (GRU)                  (None, 64, 16)            1536       _________________________________________________________________ time_distributed_1 (TimeDist (None, 64, 15)            255        _________________________________________________________________ activation_1 (Activation)    (None, 64, 15)            0          =================================================================

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize)) x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs) x = keras.layers.Dense(InputSize)(x) predictions = keras.layers.Activation('softmax')(x)

I still only have 255 parameters:

_________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= input_1 (InputLayer)         (None, 64, 15)            0          _________________________________________________________________ gru_1 (GRU)                  (None, 64, 16)            1536       _________________________________________________________________ dense_1 (Dense)              (None, 64, 15)            255        _________________________________________________________________ activation_1 (Activation)    (None, 64, 15)            0          =================================================================

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:

def build(self, input_shape):     assert len(input_shape) >= 2     input_dim = input_shape[-1]      self.kernel = self.add_weight(shape=(input_dim, self.units),

It also uses keras.dot to apply the weights:

def call(self, inputs):     output = K.dot(inputs, self.kernel)

The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.

339

asked Jun 18 '17 01:06

thon

1 Answers

TimeDistributedDense applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).

However, with return_sequences=False, Dense layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True then Dense layer is applied to every timestep just like TimeDistributedDense.

So for as per your models both are same, but if you change your second model to return_sequences=False, then Dense will be applied only at the last cell. Try changing it and the model will throw as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

from keras.models import Sequential from keras.layers import Dense, Activation, TimeDistributed from keras.layers.recurrent import GRU import numpy as np  InputSize = 15 MaxLen = 64 HiddenSize = 16  OutputSize = 8 n_samples = 1000  model1 = Sequential() model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize))) model1.add(TimeDistributed(Dense(OutputSize))) model1.add(Activation('softmax')) model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')   model2 = Sequential() model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize))) model2.add(Dense(OutputSize)) model2.add(Activation('softmax')) model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')  model3 = Sequential() model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize))) model3.add(Dense(OutputSize)) model3.add(Activation('softmax')) model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')  X = np.random.random([n_samples,MaxLen,InputSize]) Y1 = np.random.random([n_samples,MaxLen,OutputSize]) Y2 = np.random.random([n_samples, OutputSize])  model1.fit(X, Y1, batch_size=128, nb_epoch=1) model2.fit(X, Y1, batch_size=128, nb_epoch=1) model3.fit(X, Y2, batch_size=128, nb_epoch=1)  print(model1.summary()) print(model2.summary()) print(model3.summary())

In the above example architecture of model1 and model2 are sample (sequence to sequence models) and model3 is a full sequence to label model.

183

answered Oct 08 '22 18:10

mujjiga

Related questions
                            
                                Tensor is not an element of this graph
                            
                                What's the difference between LSTM() and LSTMCell()?
                            
                                Is there a better way to guess possible unknown variables without brute force than I am doing? Machine learning? [duplicate]
                            
                                What is the meaning of the nu parameter in Scikit-Learn's SVM class?
                            
                                keras BatchNormalization axis clarification
                            
                                How to disable dropout while prediction in keras?
                            
                                ValueError: Variable rnn/basic_rnn_cell/kernel already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope?
                            
                                Clustering Algorithm for Mapping Application
                            
                                Batch normalization instead of input normalization
                            
                                Tensorflow mean squared error loss function
                            
                                How does mask_zero in Keras Embedding layer work?
                            
                                Unit Testing Machine Learning Code
                            
                                What is OOF approach in machine learning?
                            
                                Difference between Dense and Activation layer in Keras
                            
                                Show training and validation accuracy in TensorFlow using same graph
                            
                                Difference between cross_val_score and cross_val_predict
                            
                                Difference between parameters, features and class in Machine Learning
                            
                                Tensorflow Keras Copy Weights From One Model to Another
                            
                                Why the cost function of logistic regression has a logarithmic expression?
                            
                                How can I do Train And Test step in Giza++?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

Tags:

machine-learning

neural-network

keras

keras-layer

recurrent-neural-network

thon

People also ask

1 Answers

mujjiga

Recent Activity

Donate For Us