Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How quartz detect nodes fails

My production environment running a java scheduler job using quartz 2.1.4. on weblogic cluster server with 4 machine and only one schedule job execute at one cluster node (node 1) normally for few months, but node 2 sudden find the node 1 fail at take over the executing job last night. In fact, the node 1 without error (according to the server, network, database, application log), this event caused duplicate message created due to 2 process concurrent execute.

What is the mechanism of quartz to detect node fails? By ping scan, or heart beat ping via UCP broadcast, or database respond time other? Any configuration on it?

I have read the quartz configuration guide http://quartz-scheduler.org/documentation/quartz-2.1.x/configuration/ConfigJDBCJobStoreClustering , but there is no answer.

I am using JDBCJobstore. After details checking, we found that there is a database (Oracle) statement executing abnormal long (from 5 sec to 30 sec). The incident happened on this period of time. Do you think it related?

my configuration is

` org.quartz.threadPool.threadCount=10

org.quartz.threadPool.threadPriority=5

org.quartz.jobStore.misfireThreshold = 10000

org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX `

Anyone have this information? Thanks.

like image 250
Calvin Lai Avatar asked Oct 19 '12 02:10

Calvin Lai


1 Answers

I know the answer is very late, but maybe somebody like both of us will still need it.

Short version: it is all handled by DB. Important property would be org.quartz.jobStore.clusterCheckinInterval.

Long version (all credits go to http://flylib.com/books/en/2.65.1.91/1/ ) :

Detecting Failed Scheduler Nodes

When a Scheduler instance performs the check-in routine, it looks to see if there are other Scheduler instances that didn't check in when they were supposed to. It does this by inspecting the SCHEDULER_STATE table and looking for schedulers that have a value in the LAST_CHECK_TIME column that is older than the property org.quartz.jobStore.clusterCheckinInterval (discussed in the next section). If one or more nodes haven't checked in, the running Scheduler assumes that the other instance(s) have failed.

Additionally the next paragraph might also be important:

Running Nodes on Separate Machines with Unsynchronized Clocks

As you can ascertain by now, if you run nodes on different machines and the clocks are not synchronized, you can get unexpected results. This is because a timestamp is being used to inform other instances of the last time one node checked in. If that node's clock was set for the future, a running Scheduler might never realize that a node has gone down. On the other hand, if a clock on one node is set in the past, a node might assume that the node has gone down and attempt to take over and rerun its jobs. In either case, it's not the behavior that you want. When you're using different machines in a cluster (which is the normal case), be sure to synchronize the clocks. See the section "Quartz Clustering Cookbook," later in this chapter for details on how to do this.

like image 132
Michał Cegielski Avatar answered Oct 02 '22 21:10

Michał Cegielski