Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JVM crash because of lock on nfs file after network outage

Following code snippet causes JVM crash: if network outage occurs after acquiring lock

    while (true) {

       //file shared over nfs
       String filename = "/home/amit/mount/lock/aLock.txt";
       RandomAccessFile file = new RandomAccessFile(filename, "rws");
       System.out.println("file opened");
       FileLock fileLock = file.getChannel().tryLock();
       if (fileLock != null) {
          System.out.println("lock acquired");
       } else {
          System.out.println("lock not acquired");
       }

       try {
          //wait for 15 sec
          Thread.sleep(30000);
       } catch (InterruptedException e) {
          e.printStackTrace();
       }
       System.out.println("closing filelock");
       fileLock.close();
       System.out.println("closing file");
       file.close();
    }

Observation: JVM receives KILL(9) signal and exits with exit code 137(128+9).

Probably after network connection re-establishment something goes wrong in file-descriptor tables. This behavior is reproducible with system call flock(2) and shell utility flock(1).

Any suggestion/work-arounds?

PS: using Oracle JDK 1.7.0_25 with NFSv4

EDIT: This lock will be used to identify which of process is active in distributed high availability cluster. The exit code is 137. What I expect? way to detect problem. close file and try to re-acquire.

like image 353
Amit G Avatar asked Sep 18 '13 10:09

Amit G


3 Answers

After NFS server reboots, all clients that have any active file locks start the lock reclamation procedure that lasts no longer than so-called "grace period" (just a constant). If the reclamation procedure fails during the grace period, NFS client (usually a kernel space beast) sends SIGUSR1 to a process that wasn't able to recover its locks. That's the root of your problem.

When the lock succeeds on the server side, rpc.lockd on the client system requests another daemon, rpc.statd, to monitor the NFS server that implements the lock. If the server fails and then recovers, rpc.statd will be informed. It then tries to reestablish all active locks. If the NFS server fails and recovers, and rpc.lockd is unable to reestablish a lock, it sends a signal (SIGUSR1) to the process that requested the lock.

http://menehune.opt.wfu.edu/Kokua/More_SGI/007-2478-010/sgi_html/ch07.html

You're probably wondering how to avoid this. Well, there're a couple of ways, but none is ideal:

  1. Increase grace period. AFAIR, on linux it can be changed via /proc/fs/nfsd/nfsv4leasetime.
  2. Make a SIGUSR1 handler in your code and do something smart there. For instance in a signal handler you could set a flag denoting that locks recovery is failed. If this flag is set your program can try to wait for a readiness of NFS server (as long as it needs) and then it can try to recover locks itself. Not very fruitful...
  3. Do not use NFS locking ever again. If it's possible switch to zookeeper as was suggested earlier.
like image 43
Dan Kruchinin Avatar answered Oct 09 '22 21:10

Dan Kruchinin


Exit code 138 does NOT hint at SIGKILL - this is signal 10, which can be SIGBUS (on solaris) or SIGUSR1 (on linux). Unfortunately, you don't tell us which one you're using.

In theory, nfs should handle everything transparently - the machine crashes, reboots, and clears the locks. In practise, i've never seen this work well in NFS3, and NFS4 (which you're using) makes things even harder, as there's no separate lockd() and statd().

I'd recommend you run truss(solaris) or strace (linux) on your java process, then pull the network plug, to find out what's really going on. But to be honest, locking on NFS file systems is something people have recommended against for as long as i'm using Unix (more than 25 years by now), and i'd strongly recommend you write a small server program that handles the "who does what" thing. Let your clients connect to the server, let them send some "starting with X" and "stopping to do X" message to the server, and have the server gracefully timeout the connection if a client doesn't answer for more than, say, 5 minutes. I'm 99% sure this will take you less time than trying to fix NFS locking.

like image 114
Guntram Blohm Avatar answered Oct 09 '22 20:10

Guntram Blohm


This behavior is reproducible with system call flock(2) and shell utility flock(1).

Since you're able to reproduce it outside of Java, it sounds like an infrastructure issue. You didn't give too much information on your NFS server or client OS, but one thing that I've seen cause weird behavior with NFS is incorrect DNS configuration.

Check that the output from "uname -n" and "hostname" on the client match your DNS records. Check that the NFS server is resolving DNS correctly.

Like Guntram, I too advise against using NFS for this sort of thing. I would use either Hazlecast (no server, instances dynamically cluster) or ZooKeeper (need to setup a server).

With Hazlecast, you can do this to acquire an exclusive cluster-wide lock:

import com.hazelcast.core.Hazelcast;
import java.util.concurrent.locks.Lock;

Lock lock = Hazelcast.getLock(myLockedObject);
lock.lock();
try {
    // do something here
} finally {
    lock.unlock();
} 

It also supports timeouts:

if (lock.tryLock (5000, TimeUnit.MILLISECONDS)) {
    try {  
       // do some stuff here..  
   } 
    finally {  
      lock.unlock();  
    }   
} 
like image 1
John R Avatar answered Oct 09 '22 22:10

John R