Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spring batch difference between Multithreading vs partitioning

I cannot understand the difference between multi-threading and partitioning in Spring batch. The implementation is of course different: In partitioning you need to prepare the partitions then process it. I want to know what is the difference and which one is more efficient way to process when the bottleneck is the item-processor.

like image 519
mettok Avatar asked Jun 11 '15 16:06

mettok


People also ask

What is Spring Batch partitioning?

In Spring Batch, “Partitioning” is “multiple threads to process a range of data each”. For example, assume you have 100 records in a table, which has “primary id” assigned from 1 to 100, and you want to process the entire 100 records. Normally, the process starts from 1 to 100, a single thread example.

How does multithreading work in Spring Batch?

Multithreaded steps. By default, Spring Batch uses the same thread to execute a batch job from start to finish, meaning that everything runs sequentially. Spring Batch also allows multithreading at the step level. This makes it possible to process chunks using several threads.

How can I improve my spring batch performance?

read an ASCII file with fixed column length values sent by a third-party with previously specified layout (STEP 1 reader) validate the read values and register (Log file) the errors (custom messages) Apply some business logic on the processor to filter any undesirable lines (STEP 1 processor)

Does spring batch run in parallel?

Spring Batch Parallel Processing is each chunk in its own thread by adding a task executor to the step. If there are a million records to process and each chunk is 1000 records, and the task executor exposes four threads, you can handle 4000 records in parallel instead of 1000 records.


1 Answers

TL;DR;
Neither approach is intended to help when the bottleneck is in the processor. You will see some gains by having multiple items going through a processor at the same time, but both of the options you point out get their full benefits when used in processes that are I/O bound. The AsyncItemProcessor/AsyncItemWriter may be a better option.

Overview of Spring Batch Scalability
There are five options for scaling Spring Batch jobs:

  1. Multithreaded step
  2. Parallel steps
  3. Partitioning
  4. Remote chunking
  5. AsyncItemProcessor/AsyncItemWriter

Each has it's own benefits and disadvantages. Let's walk through each:

Multithreaded step
A multithreaded step takes a single step and executes each chunk within that step on a separate thread. This means that the same instances of each of the batch components (readers, writers, etc) are shared across the threads. This can increase performance by adding some parallelism to the step at the cost of restartability in most cases. You sacrifice restartability because in most cases, the ability to restart is based on the state maintained within the reader/writer/etc. With multiple threads updating that state, it becomes invalid and useless for restart. Because of this, you typically need to turn save state off on individual components and set the restartable flag to false on the job.

Parallel steps
Parallel steps are achieved via a split. It allows you to execute multiple, independent steps in parallel via threads. This does not sacrifice restartability, but does not help improve the performance of a single step or piece of business logic.

Partitioning
Partitioning is the dividing of data, in advance, into smaller chunks (called partitions) by a master step and then having slaves work independently on the partitions. In Spring Batch, both the master and each slave, is an independent step so you can get the benefits of parallelism within a single step without sacrificing restartability. Partitioning also provides the ability to scale beyond a single JVM in that the slaves do not have to be local (you can use various communication mechanisms to communicate with remote slaves).

An important note about partitioning is that the only communication between the master and slave is a description of the data and not the data itself. For example, the master may tell slave1 to process records 1-100, slave2 to process records 101-200, etc. The master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. Because of this, the data must be local to the slave processes and the master can be located anywhere.

Remote chunking
Remote chunking allows you to scale the process and optionally the write logic across JVMs. In this use case, the master reads the data and then sends it over the wire to the slaves where it is processed and then either written locally to the slave or returned to the master for writing local to the master.

The important difference between partitioning and remote chunking is that instead of a description going over the wire, remote chunking sends the actual data over the wire. So instead of a single packet saying process records 1-100, remote chunking is going to send the actual records 1-100. This can have a large impact on the I/O profile of a step, but if the processor is enough of a bottleneck, this can be useful.

AsyncItemProcessor/AsyncItemWriter
The final option for scaling Spring Batch processes is the AsyncItemProcessor/AsycnItemWriter combination. In this case, the AsyncItemProcessor wraps your ItemProcessor implementation and executes the call to your implementation in a separate thread. The AsyncItemProcessor then returns a Future that is passed to the AsyncItemWriter where it is unwrapped and passed to the delegate ItemWriter implementation.

Because of the nature of how data flows through this option, certain listener scenarios are not supported (since we don't know the outcome of the ItemProcessor call until inside the ItemWriter) but overall, it can provide a useful tool for parallelizing just the ItemProcessor logic in a single JVM without sacrificing restartability.

like image 103
Michael Minella Avatar answered Sep 24 '22 18:09

Michael Minella