I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.
My simplified model is the following:
InputSize = 15 MaxLen = 64 HiddenSize = 16 inputs = keras.layers.Input(shape=(MaxLen, InputSize)) x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs) x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x) predictions = keras.layers.Activation('softmax')(x)
The summary of the network is:
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 64, 15) 0 _________________________________________________________________ gru_1 (GRU) (None, 64, 16) 1536 _________________________________________________________________ time_distributed_1 (TimeDist (None, 64, 15) 255 _________________________________________________________________ activation_1 (Activation) (None, 64, 15) 0 =================================================================
This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).
However, if I switch to a simple Dense layer:
inputs = keras.layers.Input(shape=(MaxLen, InputSize)) x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs) x = keras.layers.Dense(InputSize)(x) predictions = keras.layers.Activation('softmax')(x)
I still only have 255 parameters:
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 64, 15) 0 _________________________________________________________________ gru_1 (GRU) (None, 64, 16) 1536 _________________________________________________________________ dense_1 (Dense) (None, 64, 15) 255 _________________________________________________________________ activation_1 (Activation) (None, 64, 15) 0 =================================================================
I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).
Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:
def build(self, input_shape): assert len(input_shape) >= 2 input_dim = input_shape[-1] self.kernel = self.add_weight(shape=(input_dim, self.units),
It also uses keras.dot to apply the weights:
def call(self, inputs): output = K.dot(inputs, self.kernel)
The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.
TimeDistributed class tf. TimeDistributed(layer, **kwargs) This wrapper allows to apply a layer to every temporal slice of an input. Every input should be at least 3D, and the dimension of index one of the first input will be considered to be the temporal dimension.
Dense layer is the regular deeply connected neural network layer. It is most common and frequently used layer. Dense layer does the below operation on the input and return the output.
Like we use LSTM layers mostly in the time series analysis or in the NLP problems, convolutional layers in image processing, etc. A dense layer also referred to as a fully connected layer is a layer that is used in the final stages of the neural network.
Dense Layer is simple layer of neurons in which each neuron receives input from all the neurons of previous layer, thus called as dense. Dense Layer is used to classify image based on output from convolutional layers. Working of single neuron. A layer contains multiple number of such neurons.
TimeDistributedDense
applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).
However, with return_sequences=False
, Dense
layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True
then Dense
layer is applied to every timestep just like TimeDistributedDense
.
So for as per your models both are same, but if you change your second model to return_sequences=False
, then Dense
will be applied only at the last cell. Try changing it and the model will throw as error because then the Y
will be of size [Batch_size, InputSize]
, it is no more a sequence to sequence but a full sequence to label problem.
from keras.models import Sequential from keras.layers import Dense, Activation, TimeDistributed from keras.layers.recurrent import GRU import numpy as np InputSize = 15 MaxLen = 64 HiddenSize = 16 OutputSize = 8 n_samples = 1000 model1 = Sequential() model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize))) model1.add(TimeDistributed(Dense(OutputSize))) model1.add(Activation('softmax')) model1.compile(loss='categorical_crossentropy', optimizer='rmsprop') model2 = Sequential() model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize))) model2.add(Dense(OutputSize)) model2.add(Activation('softmax')) model2.compile(loss='categorical_crossentropy', optimizer='rmsprop') model3 = Sequential() model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize))) model3.add(Dense(OutputSize)) model3.add(Activation('softmax')) model3.compile(loss='categorical_crossentropy', optimizer='rmsprop') X = np.random.random([n_samples,MaxLen,InputSize]) Y1 = np.random.random([n_samples,MaxLen,OutputSize]) Y2 = np.random.random([n_samples, OutputSize]) model1.fit(X, Y1, batch_size=128, nb_epoch=1) model2.fit(X, Y1, batch_size=128, nb_epoch=1) model3.fit(X, Y2, batch_size=128, nb_epoch=1) print(model1.summary()) print(model2.summary()) print(model3.summary())
In the above example architecture of model1
and model2
are sample (sequence to sequence models) and model3
is a full sequence to label model.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With