Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spring Batch - Reading a large flat file - Choices to scale horizontally?

I have started researching Spring Batch in the last hour or two. And require your inputs.

The problem : Read a/multiple csv file(s) with 20 million data, perform minor processing, store it in db and also write output to another flat file in the least time.

Most important : I need to make choices which will scale horizontally in the future.

Questions :

Use Remote Chunking or Partitioning to scale horizontally?

Since data is in a flat file both Remote Chunking and Partitioning are bad choices?

Which multi process solution will make it possible to read from a large file, spread processing across multiple servers and update Db but finally write/output to a single file?

Does multiresourcepartitioner work across servers?

Any good tutorials you know of where something like this has been accomplished/demonstrated?

Your thoughts on how this needs to be attempted like 1) Split large file into smaller files before starting the job 2) Read one file at a time using the Item Reader...........

like image 770
user3888680 Avatar asked Jul 29 '14 17:07

user3888680


1 Answers

Assuming "minor processing" isn't the bottle neck in the processing, the best option to scale this type of job is via partitioning. The job would have two steps. The first would split the large file into smaller files. To do this, I'd recommend using the SystemCommandTasklet to shell out to the OS to split the file (this is typically more performant than streaming the entire file through the JVM). An example of doing that would look something like this:

<bean id="fileSplittingTasklet" class="org.springframework.batch.core.step.tasklet.SystemCommandTasklet" scope="step">
    <property name="command" value="split -a 5 -l 10000 #{jobParameters['inputFile']} #{jobParameters['stagingDirectory']}"/>
    <property name="timeout" value="60000"/>
    <property name="workingDirectory" value="/tmp/input_temp"/>
</bean>

The second step would be a partitioned step. If the files are located in a place that is not shared, you'd use local partitioning. However, if the resulting files are on a network share somewhere, you can use remote partitioning. In either case, you'd use the MultiResourcePartitioner to generate a StepExecution per file. These would then be executed via the slaves (either locally running on threads or remotely listening to some messaging middleware).

One thing to note in this approach is that the order the records are processed from the original file will not be maintained.

You can see a complete remote partitioning example here: https://github.com/mminella/Spring-Batch-Talk-2.0 and a video of the talk/demo can be found here: https://www.youtube.com/watch?v=CYTj5YT7CZU

like image 186
Michael Minella Avatar answered Nov 16 '22 02:11

Michael Minella