Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between data time major and batch major?

Can anyone explain what data time major and batch major mean and what's the difference between them?

like image 328
Chemss-Eddine BenHassine Avatar asked Feb 14 '18 09:02

Chemss-Eddine BenHassine


People also ask

What is the difference between near-time and batch processing?

Batch processing is even less time-sensitive than near real-time. In fact, batch processing jobs can take hours, or perhaps even days. Batch processing involves three separate processes. First, data is collected, usually over a period of time. Second, the data is processed by a separate program. Thirdly, the data is output.

What is batch data processing?

Batch processing involves three separate processes. First, data is collected, usually over a period of time. Second, the data is processed by a separate program. Thirdly, the data is output. Examples of data entered in for analysis can include operational data, historical and archived data, data from social media, service data, etc.

What is the difference between real time processing and batch processing?

Batch processing systems are characterized by their greater degree of flexibility in operations and rapid response to evolving market conditions. Real time processing, on the contrary, happens immediately; as soon as a transaction takes place, it is processed. The systems need to be very active and responsive at all times.

What is the difference between real-time vs batch data integration?

Real-Time vs. Batch Data Integration: Which is Better for Which Use Cases? When it comes to big data, there are two main ways to process information. The first – and more traditional – approach is batch-based data integration. The second is real-time integration.


1 Answers

Trying to put it in simplest terms: these are different representations (or arrangements) of the same data.

2D example

For example, imagine you have the data like this (just for the sake of illustration, not real data):

1 11 21 31
2 12 22 32
3 13 23 33
...
100 111 121 131

... where each row corresponds to a training input and each column corresponds to a different feature. The matrix has size (batch_size, features), where batch_size=100 and features=4.

Next, in some cases, you may get a transposed matrix as input (for instance, it's an output from the previous step):

1 2 3 ... 100
11 12 13 ... 111
21 22 23 ... 121
31 32 33 ... 131

In this case, the matrix shape is (features, batch_size). Note: the data itself doesn't change. Only the array dimensions have changed: batch is the 0-axis in the first example and 1-axis in the second example. Also note that one can swap different presentations very easily and efficiently. In tensorflow, it can be done with tf.transpose.

Time major vs Batch major

When in comes to RNNs, the tensors usually go to rank 3+, but the idea stays the same. If the input is (batch_size, sequence_num, features), it's called batch major, because the 0 axis is the batch_size. If the input is (sequence_num, batch_size, features), it's called time major likewise. The features is always the last dimension (at least I don't know real cases when it's not), so there's no further variety in naming.

Depending on the network structure, it might expect specifically the batch or the time as the 0 axis, so the format of input data matters. And depending on the previous layers, one can get either of the those representations to be fed into an RNN. So the conversion from one arrangement to another might be required, either by the library function or by the caller. As far as I can remember, batch major is the default in tensorflow and keras, so it simply boils down what shape is produced from the layer just before the RNN.

Once again: there is one-to-one correspondence between batch major and time major representations. Any tensor can be represented as both. But for a particular implementation, one of those can be expected or required.

like image 104
Maxim Avatar answered Oct 22 '22 17:10

Maxim