In Spring Batch partitioning, the relationship between the gridSize
of the PartitionHandler and the number of ExecutionContexts returned by the Partitioner is a little confusing. For example, MultiResourcePartitioner states that it ignores gridSize, but the Partitioner
documentation doesn't explain when/why this is acceptable to do.
For example, let's say I have a taskExecutor
that I want to re-use across different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5, and a MultiResourcePartitioner
that returns an arbitrary number of partitions (one per file), how will the parallelism actually behave?
Let's say the MultiResourcePartitioner
returns 10 partitions for a particular run. Does this mean that only 5 of them will execute at a time until all 10 have completed, and that no more than 5 of the 20 threads will be used for this step?
If this is the case, when/why is it okay to ignore the 'gridSize' parameter when overriding Parititioner
with a custom implementation? I think it would help if this was described in the documentation.
If this isn't the case, how can I achieve this? That is, how can I re-use a task executor and separately define the number of partitions that can run parallel for that step and the number of partitions that actually get created?
There are a few good questions here so let's walk through them individually:
For example, let's say I have a taskExecutor that I want to re-use across different parallel steps, and that I set its size to 20. If I use a TaskExecutorPartitionerHandler with a grid size of 5, and a MultiResourcePartitioner that returns an arbitrary number of partitions (one per file), how will the parallelism actually behave?
The TaskExecutorPartitionHandler
defers the concurrency limitations to the TaskExecutor
you provide. Because of this, in your example, the PartitionHandler
will use up to all 20 threads, as the TaskExecutor
allows.
If this is the case, when/why is it okay to ignore the 'gridSize' parameter when overriding Parititioner with a custom implementation? I think it would help if this was described in the documentation.
When we look at a partitioned step, there are two components of concern: the Partitioner
and the PartitionHandler
. The Partitioner
is responsible for understanding the data to be divided up and how best to do so. The PartitionHandler
is responsible for delegating that work out to slaves for execution. In order for the PartitionHandler
to do its delegation, it needs to understand the "fabric" that it's working with (local threads, remote slave processes, etc).
When dividing up the data to be worked on (via the Partitioner
) it can be useful to know how many workers are available. However, that metric isn't always very useful based on the data you're working with. For example, dividing database rows, it makes sense to divide them evenly by the number of workers available. However it's impractical in most scenarios to combine or divide files up so it's just easier to create a partition per file. Both of these scenarios are dependent upon the data you're trying to divide up as to whether the gridSize is useful or not.
If this isn't the case, how can I achieve this? That is, how can I re-use a task executor and separately define the number of partitions that can run parallel for that step and the number of partitions that actually get created?
If you're re-using a TaskExecutor
, you may not be able to since that TaskExecutor
may be doing other things. I wonder why you'd re-use one given the relatively low overhead of creating one dedicated (you can even make it step scoped so it's only created when the partitioned step is running).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With