Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Configuring gridSize in Spring Batch partitioning

In Spring Batch partitioning, the relationship between the gridSize of the PartitionHandler and the number of ExecutionContexts returned by the Partitioner is a little confusing. For example, MultiResourcePartitioner states that it ignores gridSize, but the Partitioner documentation doesn't explain when/why this is acceptable to do.

For example, let's say I have a taskExecutor that I want to re-use across different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5, and a MultiResourcePartitioner that returns an arbitrary number of partitions (one per file), how will the parallelism actually behave?

Let's say the MultiResourcePartitioner returns 10 partitions for a particular run. Does this mean that only 5 of them will execute at a time until all 10 have completed, and that no more than 5 of the 20 threads will be used for this step?

If this is the case, when/why is it okay to ignore the 'gridSize' parameter when overriding Parititioner with a custom implementation? I think it would help if this was described in the documentation.

If this isn't the case, how can I achieve this? That is, how can I re-use a task executor and separately define the number of partitions that can run parallel for that step and the number of partitions that actually get created?

like image 420
Jared Gommels Avatar asked Jul 14 '15 19:07

Jared Gommels


1 Answers

There are a few good questions here so let's walk through them individually:

For example, let's say I have a taskExecutor that I want to re-use across different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5, and a MultiResourcePartitioner that returns an arbitrary number of partitions (one per file), how will the parallelism actually behave?

The TaskExecutorPartitionHandler defers the concurrency limitations to the TaskExecutor you provide. Because of this, in your example, the PartitionHandler will use up to all 20 threads, as the TaskExecutor allows.

If this is the case, when/why is it okay to ignore the 'gridSize' parameter when overriding Parititioner with a custom implementation? I think it would help if this was described in the documentation.

When we look at a partitioned step, there are two components of concern: the Partitioner and the PartitionHandler. The Partitioner is responsible for understanding the data to be divided up and how best to do so. The PartitionHandler is responsible for delegating that work out to slaves for execution. In order for the PartitionHandler to do its delegation, it needs to understand the "fabric" that it's working with (local threads, remote slave processes, etc).

When dividing up the data to be worked on (via the Partitioner) it can be useful to know how many workers are available. However, that metric isn't always very useful based on the data you're working with. For example, dividing database rows, it makes sense to divide them evenly by the number of workers available. However it's impractical in most scenarios to combine or divide files up so it's just easier to create a partition per file. Both of these scenarios are dependent upon the data you're trying to divide up as to whether the gridSize is useful or not.

If this isn't the case, how can I achieve this? That is, how can I re-use a task executor and separately define the number of partitions that can run parallel for that step and the number of partitions that actually get created?

If you're re-using a TaskExecutor, you may not be able to since that TaskExecutor may be doing other things. I wonder why you'd re-use one given the relatively low overhead of creating one dedicated (you can even make it step scoped so it's only created when the partitioned step is running).

like image 118
Michael Minella Avatar answered Oct 22 '22 04:10

Michael Minella