Use case: Read 10 million rows [10 columns] from database and write to a file (csv format). <ol> <li>Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? </li> <li>Which would be better performing (fast) in the above use case?</li> <li>Would the selection be different in case of a single-process vs multi-process approach?</li> <li>In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?</li> </ol>

To process that kind of data, you're probably going to want to parallelize it if that is possible (the only thing preventing it would be if the output file needed to retain an order from the input). Assuming you are going to parallelize your processing, you are then left with two main options for this type of use case (from what you have provided): <ol> <li>Multithreaded step - This will process a chunk per thread until complete. This allows for parallelization in a very easy way (simply adding a TaskExecutor to your step definition). With this, you do loose restartability out of the box because you will need to turn off state persistence on either of the ItemReaders you have mentioned (there are ways around this with flagging records in the database as having been processed, etc).</li> <li>Partitioning - This breaks up your input data into partitions that are processed by step instances in parallel (master/slave configuration). The partitions can be executed locally via threads (via a TaskExecutor) or remotely via remote partitioning. In either case, you gain restartability (each step processes it's own data so there is no stepping on state from partition to partition) with parallization. </li> </ol> I did a talk on processing data in parallel with Spring Batch. Specifically, the example I present is a remote partitioned job. You can view it here: https://www.youtube.com/watch?v=CYTj5YT7CZU To your specific questions: <ol> <li> Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? - Either of these two options can be tuned to meet many performance needs. It really depends on the database you're using, driver options available as well as processing models you can support. Another consideration is, do you need restartability?</li> <li> Which would be better performing (fast) in the above use case? - Again it depends on your processing model chosen. </li> <li> Would the selection be different in case of a single-process vs multi-process approach? - This goes to how you manage jobs more so than what Spring Batch can handle. The question is, do you want to manage partitioning external to the job (passing in the data description to the job as parameters) or do you want the job to manage it (via partitioning).</li> <li> In case of a multi-threaded approach using TaskExecutor, which one would be better & simple? - I won't deny that remote partitioning adds a level of complexity that local partitioning and multithreaded steps don't have. </li> </ol> I'd start with the basic step definition. Then try a multithreaded step. If that doesn't meet your needs, then move to local partitioning, and finally remote partitioning if needed. Keep in mind that Spring Batch was designed to make that progression as painless as possible. You can go from a regular step to a multithreaded step with only configuration updates. To go to partitioning, you need to add a single new class (a Partitioner implementation) and some configuration updates. One final note. Most of this has talked about parallelizing the processing of this data. Spring Batch's FlatFileItemWriter is not thread safe. Your best bet would be to write to multiple files in parallel, then aggregate them afterwards if speed is your number one concern.

You should profile this in order to make a choice. In plain JDBC I would start with something that: <ul> <li>prepares statements with <code>ResultSet.TYPE_FORWARD_ONLY</code> and <code>ResultSet.CONCUR_READ_ONLY</code>. Several JDBC drivers "simulate" cursors in client side unless you use those two, and for large result sets you don't want that as it will probably lead you to <code>OutOfMemoryError</code> because your JDBC driver is buffering the entire data set in memory. By using those options you increase the chance that you get server side cursors and get the results "streamed" to you bit by bit, which is what you want for large result sets. Note that some JDBC drivers always "simulate" cursors in client side, so this tip might be useless for your particular DBMS.</li> <li>set a reasonable fetch size to minimize the impact of network roundtrips. 50-100 is often a good starting value for profiling. As fetch size is hint, this might also be useless for your particular DBMS.</li> </ul> <code>JdbcCursorItemReader</code> seems to cover both things, but as it is said before they are not guaranteed to give you best performance in all DBMS, so I would start with that and then, if performance is inadequate, try <code>JdbcPagingItemReader</code>. I don't think doing simple processing with <code>JdbcCursorItemReader</code> will be slow for your data set size unless you have very strict performance requirements. If you really need to parallelize using <code>JdbcPagingItemReader</code> might be easier, but the interface of those two is very similar, so I would not count on it. Anyway, profile.

Spring Batch: Which ItemReader implementation to use for high volume & low latency

2 Answers

To process that kind of data, you're probably going to want to parallelize it if that is possible (the only thing preventing it would be if the output file needed to retain an order from the input). Assuming you are going to parallelize your processing, you are then left with two main options for this type of use case (from what you have provided):

Multithreaded step - This will process a chunk per thread until complete. This allows for parallelization in a very easy way (simply adding a TaskExecutor to your step definition). With this, you do loose restartability out of the box because you will need to turn off state persistence on either of the ItemReaders you have mentioned (there are ways around this with flagging records in the database as having been processed, etc).
Partitioning - This breaks up your input data into partitions that are processed by step instances in parallel (master/slave configuration). The partitions can be executed locally via threads (via a TaskExecutor) or remotely via remote partitioning. In either case, you gain restartability (each step processes it's own data so there is no stepping on state from partition to partition) with parallization.

I did a talk on processing data in parallel with Spring Batch. Specifically, the example I present is a remote partitioned job. You can view it here: https://www.youtube.com/watch?v=CYTj5YT7CZU

To your specific questions:

Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? - Either of these two options can be tuned to meet many performance needs. It really depends on the database you're using, driver options available as well as processing models you can support. Another consideration is, do you need restartability?
Which would be better performing (fast) in the above use case? - Again it depends on your processing model chosen.
Would the selection be different in case of a single-process vs multi-process approach? - This goes to how you manage jobs more so than what Spring Batch can handle. The question is, do you want to manage partitioning external to the job (passing in the data description to the job as parameters) or do you want the job to manage it (via partitioning).
In case of a multi-threaded approach using TaskExecutor, which one would be better & simple? - I won't deny that remote partitioning adds a level of complexity that local partitioning and multithreaded steps don't have.

I'd start with the basic step definition. Then try a multithreaded step. If that doesn't meet your needs, then move to local partitioning, and finally remote partitioning if needed. Keep in mind that Spring Batch was designed to make that progression as painless as possible. You can go from a regular step to a multithreaded step with only configuration updates. To go to partitioning, you need to add a single new class (a Partitioner implementation) and some configuration updates.

One final note. Most of this has talked about parallelizing the processing of this data. Spring Batch's FlatFileItemWriter is not thread safe. Your best bet would be to write to multiple files in parallel, then aggregate them afterwards if speed is your number one concern.

189

answered Sep 23 '22 15:09

Michael Minella

You should profile this in order to make a choice. In plain JDBC I would start with something that:

prepares statements with ResultSet.TYPE_FORWARD_ONLY and ResultSet.CONCUR_READ_ONLY. Several JDBC drivers "simulate" cursors in client side unless you use those two, and for large result sets you don't want that as it will probably lead you to OutOfMemoryError because your JDBC driver is buffering the entire data set in memory. By using those options you increase the chance that you get server side cursors and get the results "streamed" to you bit by bit, which is what you want for large result sets. Note that some JDBC drivers always "simulate" cursors in client side, so this tip might be useless for your particular DBMS.
set a reasonable fetch size to minimize the impact of network roundtrips. 50-100 is often a good starting value for profiling. As fetch size is hint, this might also be useless for your particular DBMS.

JdbcCursorItemReader seems to cover both things, but as it is said before they are not guaranteed to give you best performance in all DBMS, so I would start with that and then, if performance is inadequate, try JdbcPagingItemReader.

I don't think doing simple processing with JdbcCursorItemReader will be slow for your data set size unless you have very strict performance requirements. If you really need to parallelize using JdbcPagingItemReader might be easier, but the interface of those two is very similar, so I would not count on it.

Anyway, profile.

answered Sep 24 '22 15:09

gpeche

Related questions
                            
                                (Spring MVC / Jackson) Mapping query parameters to @ModelAttribute: LOWERCASE_WITH_UNDERSCORE to SNAKE_CASE fails
                            
                                restTemplate.getforobject(),exchange(),entity() .is there any pros and cons for each method?
                            
                                Failed to auto-configure a DataSource: 'spring.datasource.url'
                            
                                How to upload encoded base64 image to the server using spring
                            
                                Why does Spring MockMvc result not contain a cookie?
                            
                                How to inherit application.properties in Spring?
                            
                                Can I have Spring's @Component on enum?
                            
                                How to operate on PostgreSQL interval datatype using jdbc/spring-jdbc not using PGInterval?
                            
                                Override bean definition in java config
                            
                                How to nicely handle file upload MaxUploadSizeExceededException with Spring Security
                            
                                UnsatisfiedDependencyException: Error creating bean with name 'entityManagerFactory'
                            
                                Deserialize JSON containing (_links and _embedded) using spring-hateoas
                            
                                CrudRepository inside my custom repository implementation
                            
                                Difference between Quartz Job and Scheduling Tasks with Spring?
                            
                                What's the meaning of Realm in spring security
                            
                                How to wire Interdependent beans in Spring?
                            
                                Domain driven design and transactions in Spring environment
                            
                                @Autowired HttpServletResponse
                            
                                ref vs depends-on attributes in Spring
                            
                                How to handle a large set of data using Spring Data Repositories?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spring Batch: Which ItemReader implementation to use for high volume & low latency

Tags:

spring

jdbc

batch-processing

spring-batch

ram

People also ask

2 Answers

Michael Minella

gpeche

Recent Activity

Donate For Us