Can anyone explain what data time major and batch major mean and what's the difference between them?
Batch processing is even less time-sensitive than near real-time. In fact, batch processing jobs can take hours, or perhaps even days. Batch processing involves three separate processes. First, data is collected, usually over a period of time. Second, the data is processed by a separate program. Thirdly, the data is output.
Batch processing involves three separate processes. First, data is collected, usually over a period of time. Second, the data is processed by a separate program. Thirdly, the data is output. Examples of data entered in for analysis can include operational data, historical and archived data, data from social media, service data, etc.
Batch processing systems are characterized by their greater degree of flexibility in operations and rapid response to evolving market conditions. Real time processing, on the contrary, happens immediately; as soon as a transaction takes place, it is processed. The systems need to be very active and responsive at all times.
Real-Time vs. Batch Data Integration: Which is Better for Which Use Cases? When it comes to big data, there are two main ways to process information. The first – and more traditional – approach is batch-based data integration. The second is real-time integration.
Trying to put it in simplest terms: these are different representations (or arrangements) of the same data.
For example, imagine you have the data like this (just for the sake of illustration, not real data):
1 11 21 31
2 12 22 32
3 13 23 33
...
100 111 121 131
... where each row corresponds to a training input and each column corresponds to a different feature. The matrix has size (batch_size, features)
, where batch_size=100
and features=4
.
Next, in some cases, you may get a transposed matrix as input (for instance, it's an output from the previous step):
1 2 3 ... 100
11 12 13 ... 111
21 22 23 ... 121
31 32 33 ... 131
In this case, the matrix shape is (features, batch_size)
. Note: the data itself doesn't change. Only the array dimensions have changed: batch is the 0-axis in the first example and 1-axis in the second example. Also note that one can swap different presentations very easily and efficiently. In tensorflow, it can be done with tf.transpose
.
When in comes to RNNs, the tensors usually go to rank 3+, but the idea stays the same. If the input is (batch_size, sequence_num, features)
, it's called batch major, because the 0 axis is the batch_size
. If the input is (sequence_num, batch_size, features)
, it's called time major likewise. The features
is always the last dimension (at least I don't know real cases when it's not), so there's no further variety in naming.
Depending on the network structure, it might expect specifically the batch or the time as the 0 axis, so the format of input data matters. And depending on the previous layers, one can get either of the those representations to be fed into an RNN. So the conversion from one arrangement to another might be required, either by the library function or by the caller. As far as I can remember, batch major is the default in tensorflow and keras, so it simply boils down what shape is produced from the layer just before the RNN.
Once again: there is one-to-one correspondence between batch major and time major representations. Any tensor can be represented as both. But for a particular implementation, one of those can be expected or required.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With