Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spring Batch correctly restart uncompleted jobs in clustered environment

I used the following logic to restart the uncompleted jobs on single-node Spring Batch application:

public void restartUncompletedJobs() {

    try {
        jobRegistry.register(new ReferenceJobFactory(documetPipelineJob));

        List<String> jobs = jobExplorer.getJobNames();
        for (String job : jobs) {
            Set<JobExecution> runningJobs = jobExplorer.findRunningJobExecutions(job);

            for (JobExecution runningJob : runningJobs) {
                runningJob.setStatus(BatchStatus.FAILED);
                runningJob.setEndTime(new Date());
                jobRepository.update(runningJob);
                jobOperator.restart(runningJob.getId());
            }
        }
    } catch (Exception e) {
        LOGGER.error(e.getMessage(), e);
    }
}

Right now I'm trying to make it working on the two-node cluster. Both of the application on every node will be pointed to the shared PostgreSQL database.

Let's consider the following example: I have 2 job instances - the jobInstance1 is running right now on node1 and the jobInstance2 is running on node2. Node1 is restarted for some reason during jobInstance1 execution. After node1 restart the spring batch application tries to restart the uncompleted jobs with a logic presented above - it sees that there are 2 uncompleted job instances - jobInstance1 and jobInstance2(which is correctly running on node2) and tries to restart both of them. This way instead to restart the only jobInstance1 - it will restart both jobInstance1 and jobInstance2.. but the jobInstance2 should not be restarted because it is correctly executing right now on node2.

How to correctly restart during the application startup the not completed jobs(before the previous application termination) and prevent the situation when the jobs like jobInstance2 will be also restarted?

UPDATED

This is the solution provided in the answer below:

Get the job instances of your job with JobOperator#getJobInstances

For each instance, check if there is a running execution using JobOperator#getExecutions.

2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)

2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.

I have a question regarding #2.1 - will Spring Batch automatically restart uncompleted jobs with a running execution after application restart or do I need to do manual actions to do so?

like image 438
alexanoid Avatar asked Oct 17 '22 15:10

alexanoid


1 Answers

Your logic is not restarting uncompleted jobs. Your logic is taking currently running job executions, setting their status to FAILED and restarting them. Your logic should not find running executions, it should look for not currently running executions, especially failed ones and restart them.

How to correctly restart the failed jobs and prevent the situation when the jobs like jobInstance2 will be also restarted?

In pseudo code, what you need to do to achieve this is:

  1. Get the job instances of your job with JobOperator#getJobInstances
  2. For each instance, check if there is a running execution using JobOperator#getExecutions.

    2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)

    2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.

In your scenario:

  • jobInstance1 should be restarted in step 2.2
  • jobInstance2 should be filtered in step 2.1 since there is a running execution for it on node 2.
like image 165
Mahmoud Ben Hassine Avatar answered Oct 21 '22 03:10

Mahmoud Ben Hassine