Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing deadlock detection with Apache ZooKeeper

I work for a small software company and I have been tasked with researching a distributed lock manager for us to use. It must interface with both Java and C++.

I have been working with ZooKeeper for a couple of weeks and have implemented shared locks (read and write locks) according to the documentation. I now need to implement deadlock detection. If each client could maintain a graph of the locks, it would be fast and easy. However, you cannot reliably see every change that happens to a node in ZooKeeper, so maintaining an accurate graph would be impossible. This means that every time that I check for a deadlock, I would need to download many locks, which seems impractical.

Another solution would be to implement deadlock detection within the ZooKeeper server, which I'm working on now. Each client would create a node within '/waiting' that is named after its session ID, and its data would be the lock its waiting for. Since each lock has an ephemeral owner, I would have enough information to detect a deadlock.

The problem I have is that the ZooKeeper server doesn't have the synchronization guarantees that the ZooKeeper client has. Plus, the ZooKeeper server isn't nicely documented like the client is, because you're generally not supposed to touch it.

So my question is this: how is one supposed to implement deadlock detection with Apache ZooKeeper? I see many people here recommending ZooKeeper as a distributed lock manager, but if it can't support deadlock detection, then no one should use it for this purpose.


EDIT:

I have a working solution. I can't guarantee its correctness, but it has passed all my tests.

I am sharing my checkForDeadlock method, which is the heart of the deadlock detection algorithm. Here's the additional information that you need to know:

  • Only one client should be running deadlock detection at a time.
  • First a client tries to acquire a lock on a resource. If the resource is already locked and the client wants to wait until it becomes available, then the client next checks for a deadlock. If waiting for the resource would not cause a deadlock, then it next creates a znode in a special directory which identifies that this client is waiting for that resource. That line looks like this: waitNode = zooKeeper.create(waitingPath + "/" + sessionID, resource.getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
  • No other client should begin checking for deadlock until after this client has created the wait node.
  • If two clients attempt to acquire locks at almost the same time, but granting both would cause a deadlock, then it is slightly possible that, instead of the first client getting the lock and the second client being rejected, the first client could be rejected and the second client could get the lock. This shouldn't be a problem.
  • checkForDeadlock throws a DeadlockException if it discovers a deadlock. Otherwise, it returns normally.
  • Locks are granted strictly in order. If a resource has a granted read lock and a waiting write lock, and another client wants to acquire a read lock, it has to wait until after the write lock is granted and then released.
  • bySequenceNumber is a comparator that sorts znodes by the sequence that ZooKeeper appends to the end of sequential znodes.

Code:

private void checkForDeadlock(String pathToResource) throws DeadlockException {
    // Algorithm:
    //   For each client who holds a lock on this resource:
    //     If this client is me, announce deadlock.
    //     Otherwise, if this client is waiting for a reserved resource, recursively check for deadlock on that resource.
    try {
        List<String> lockQueue = zooKeeper.getChildren(pathToResource, false); // Last I checked, children is implemented as an ArrayList.
        // lockQueue is the list of locks on this resource.
        // FIXME There is a slight chance that lockQueue could be empty.
        Collections.sort(lockQueue, bySequenceNumber);
        ListIterator<String> lockQueueIterator = lockQueue.listIterator();
        String grantedLock = lockQueueIterator.next(); // grantedLock is one lock on this resource.
        do {
            // lockQueue must contain a write lock, because there is a lock waiting.
            String lockOwner = null;
            try {
                lockOwner = Long.toString(zooKeeper.exists(pathToResource + "/" + grantedLock, false).getEphemeralOwner());
                // lockOwner is one client who holds a lock on this resource.
            }
            catch (NullPointerException e) {
                // Locks may be released while I'm running deadlock detection. I got a NullPointerException because
                // the lock I was currently looking at was deleted. Since the lock was deleted, its owner was obviously
                // not part of a deadlock. Therefore I can ignore this lock and move on to the next one.
                // (Note that a lock can be deleted if and only if its owner is not part of a deadlock.) 
                continue;
            }
            if (lockOwner.equals(sessionID)) { // If this client is me.
                throw new DeadlockException("Waiting for this resource would result in a deadlock.");
            }
            try {
                // XXX: Is is possible that reservedResource could be null?
                String reservedResource = new String(zooKeeper.getData(waitingPath + "/" + lockOwner, false, new Stat()));
                // reservedResource is the resource that this client is waiting for. If this client is not waiting for a resource, see exception.
                // I only recursively check the next reservedResource if I havn't checked it before.
                // I need to do this because, while I'm running my deadlock detection, another client may attempt to acquire
                // a lock that would cause a deadlock. Without this check, I would loop in that deadlock cycle indefinitely.
                if (checkedResources.add(reservedResource)) {
                    checkForDeadlock(reservedResource); // Depth-first-search
                }
            }
            catch (KeeperException.NoNodeException e) {
                // lockOwner is not waiting for a resource.
            }
            catch (KeeperException e) {
                e.printStackTrace(syncOut);
            }
            // This loop needs to run for each lock that is currently being held on the resource. There are two possibilities:
            // A. There is exactly one write lock on this resource. (Any other locks would be waiting locks.)
            //      In this case, the do-while loop ensures that the write lock has been checked.
            //      The condition that requires that the current lock is a read lock ensures that no locks after the write lock will be checked.
            // B. There are one or more read locks on this resource.
            //      In this case, I just check that the next lock is a read lock before moving on.
        } while (grantedLock.startsWith(readPrefix) && (grantedLock = lockQueueIterator.next()).startsWith(readPrefix));
    }
    catch (NoSuchElementException e) {
        // The condition for the do-while loop assumes that there is a lock waiting on the resource.
        // This assumption was made because a client just reported that it was waiting on the resource.
        // However, there is a small chance that the client has since gotten the lock, or even released it before
        // we check the locks on the resource.
        // FIXME (This may be a problem.)
        // In such a case, the childrenIterator.next() call could throw a NoSuchElementException.
        // We can safely assume that we are finished searching this branch, and therefore return.
    }
    catch (KeeperException e) {
        e.printStackTrace(syncOut);
    }
    catch (InterruptedException e) {
        e.printStackTrace(syncOut);
    }

}
like image 733
dln385 Avatar asked Apr 24 '12 19:04

dln385


People also ask

Is ZooKeeper as distributed lock?

Zookeeper Locks are fully distributed locks in ZooKeeper which are globally synchronous. However, globally synchronous means at any snapshot in time no two clients think they hold the same lock. Though these we can implement these locks by using ZooKeeper.

How do you detect deadlock in your application?

There is one more method to detect Deadlock in Java, it can be done by running the program in CMD. All we need to do is collect thread dumps and then we have to command to collect, depending upon the operating system. If we are running Java 8 on windows, a command would be jcmd $PID Thread. print.

What is ZooKeeper used for?

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

How do you find a deadlock in multithreading?

If your resources are A, B, and C, then all threads should acquire them in the order of A -> B -> C, or A -> C, or B -> C, etc. Deadlock can occur if one thread acquires them in the order A -> B -> C while another thread acquires them in the order C -> B -> A.


1 Answers

You need two things to do deadlock detection, a list of lock owners, and a list of lock waiters, which the standard zk lock recipe gives you, as long as you write some sort of node id to the znodes you create.

You don't need to see every change in zookeeper to detect deadlocks. A deadlock isn't something that will appear, and disappear quickly. By definion, a deadlock will stick around until you do something about it. So if you write code so that your clients watch every lock node they are interested in, the client will eventually see the owners and waiters for each lock, and the client will be see the deadlock.

You do have to be careful however. The client may not see updates in order, as updates could occur while the client is re-registering a watch. So if a client does detect a deadlock, the client should double check that the deadlock is real by re-reading the owner/watchers for the locks involved in the deadlock, and make sure that the deadlock is real.

like image 54
sbridges Avatar answered Nov 15 '22 01:11

sbridges