Do RNNs learn different dependency patterns when the input is batch-major as opposed to time-major?
(Edit: sorry my initial argument was why it makes sense but I realized that it doesn't so this is a little OT.)
I haven't found the TF-groups reasoning behind this but it does does not make computational sense as ops are written in C++.
Intuitively, we want to mash up (multiply/add etc) different features from the same sequence on the same timestep. Different timesteps can’t be done in parallell while batch/sequences can so feature>batch/sequence>timestep.
By default Numpy and C++ uses row-major (C-like) memory layout so
[[ 0. 1. 2.]
[ 3. 4. 5.]
[ 6. 7. 8.]]
Is laying like [0,1,2,3,4,5,6,7,8]
in memory. This means that if we have
x = np.zeros([time,batch,feature])
(time_major=True
in tensorflow)
In Row-major memory we get a layout like x[0,0,0],x[0,0,1],x[0,0,2],…,x[0,1,0],...
so ex. dot product of weights and vectors from the same sequence and timestep (w*x[t,b,:]
) is the most contiguous operation followed by next sequence w*x[t,b+1,:]
etc. This is what we want during training.
With time_major=False
which is default we have [batch,time,feature] so ex features from same sequence but different timesteps are more contiguous i.e w*x[batch,t,:]
followed by w*x[batch,t+1,:]
etc. This might be faster for prediction of one sequence at a time if RNN is rolled out but this is speculation.
If you came to this question for the same reason I did, I learned to be careful with the slightly unintuitive Numpy-indexing which is meant to be pythonic, not necessarily Row Major. Look at this. As expected:
x = np.zeros([3,3])
x[0:9].flat = np.arange(10)
print x
> [[ 0. 1. 2.]
> [ 3. 4. 5.]
> [ 6. 7. 8.]]
We would also expect x[1] == x[0,1]
but
print x[1]
> [ 3. 4. 5.]
print x[np.arange(10)<=4]
> IndexError: index 3 is out of bounds for axis 0 with size 3
There is no difference in what the model learns.
At timestep t, RNNs need results from t-1, therefore we need to compute things time-major. If time_major=False
, TensorFlow transposes batch of sequences from (batch_size, max_sequence_length)
to (max_sequence_length, batch_size)
*. It processes the transposed batch one row at a time: at t=0, the first element of each sequence is processed, hidden states and outputs calculated; at t=max_sequence_length, the last element of each sequence is processed.
So if your data is already time-major, use time_major=True
, which avoids a transpose. But there isn't much point in manually transposing your data before feeding it to TensorFlow.
*If you have multidimensional inputs (e.g. sequences of word embeddings: (batch_size, max_sequence_length, embedding_size)
), axes 0 and 1 are transposed, leading to (max_sequence_length, batch_size, embedding_size)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With