Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spring Batch and Pivotal Cloud Foundry [closed]

Tags:

spring-batch

We are evaluating Spring Batch framework to replace our home grown batch framework in our organization and we should be able to deploy the batch in Pivotal Cloud Foundry (PCF). In this regard, can you let us know your thoughts on the issue below:

  • Let us say if we use Remote Partitioning strategy to process large volume of records, can the batch job auto scale Slave nodes in the cloud based on the amount of that the batch job processes? Or we have to scale appropriate number of Slave nodes and keep them in place before the batch job kicks-off?
  • How does the "grid size" parameter configuration in the scenario above?
like image 440
Murali Avatar asked Feb 19 '16 16:02

Murali


1 Answers

You have a few questions here. However, before getting into them, let me take a minute and walk through where batch processing is on PCF right now and then get to your questions.

Current state of CF

As of PCF 1.6, Diego (the dynamic runtime within CF) provided a new primitive called Tasks. Traditionally, all applications running on CF were expected to be long running processes. Because of this, in order to run a batch job on CF, you'd need to package it up as a long running process (web app usually) and then deploy that. If you wanted to use remote partitioning, you'd need to deploy and scale slaves as you saw fit, but it was all external to CF. With Tasks, Diego now supports short lived processes...aka processes that won't be restarted when they complete. This means that you can run a batch job as a Spring Boot über jar and once it completes, CF won't try to restart it (that's a good thing). The issue with 1.6 is that an API exposing Tasks was not available so it was only an internal construct.

With PCF 1.7, a new API is being released to expose Tasks for general use. As part of the v3 API, you'll be able to deploy your own apps as Tasks. This allows you to launch a batch job as a task knowing it will execute, then be cleaned up by PCF. With that in mind...

Can the batch job auto scale Slave nodes in the cloud based on the amount of that the batch job processes?

When using Spring Batch's partitioning capabilities, there are two key components. The Partitioner and the PartitionHandler. The Partitioner is responsible for understanding the data and how it can be divided up. The PartitionHandler is responsible for understanding the fabric in which to distribute the partitions to the slaves.

For Spring Cloud Data Flow, we plan on creating a PartitionHandler implementation that will allow users to execute slave partitions as Tasks on CF. Essentially, what we'd expect is that the PartitionHandler would launch the slaves as tasks and once they are complete, they would be cleaned up.

This approach allows the number of slaves to be dynamically launched based on the number of partitions (configurable to a max).

We plan on doing this work for Spring Cloud Data Flow but the PartitionHandler should be available for users outside of that workflow as well.

How does the "grid size" parameter configuration in the scenario above?

The grid size parameter is really used by the Partitioner and not the PartitionHandler and is intended to be a hint on how many workers there may be. In this case, it could be used to configure how many partitions you want to create, but that really is up to the Partitioner implementation.

Conclusion

This is a description of how a batch workflow on CF would look like. It's important to note that CF 1.7 is not out as of the writing of this answer. It is scheduled to be out Q1 of 2016 and at that time, this functionality will follow shortly afterwards.

like image 159
Michael Minella Avatar answered Sep 22 '22 17:09

Michael Minella