Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is dispatch_sync on custom concurrent queue deadlocking

I'm seeing an intermittent deadlock in my app when using dispatch_sync on a custom concurrent dispatch_queue. I'm using something similar to the method described in Mike Ash's blog to support concurrent read access but threadsafe mutations on an NSMutableDictionary that acts as a cache of currently active network RPC requests. My project uses ARC.

I create the queue with:

dispatch_queue_t activeRequestsQueue = dispatch_queue_create("my.queue.name",
                                                DISPATCH_QUEUE_CONCURRENT);

and the mutable dictionary with

NSMutableDictionary *activeRequests = [[NSMutable dictionary alloc] init];

I read elements from the queue like this:

- (id)activeRequestForRpc: (RpcRequest *)rpc
{
    assert(![NSThread isMainThread]);
    NSString * key = [rpc getKey];
    __block id obj = nil;
    dispatch_sync(activeRequestsQueue, ^{
        obj = [activeRequests objectForKey: key];
    });
    return obj;
}

I add and remove rpcs from the cache

- (void)addActiveRequest: (RpcRequest *)rpc
{
    NSString * key = [rpc getKey];
    dispatch_barrier_async(activeRequestsQueue, ^{
        [activeRequests setObject: rpc forKey: key];
    });
}

- (void)removeActiveRequest: (RpcRequest *)rpc
{
    NSString * key = [rpc getKey];
    dispatch_barrier_async(activeRequestsQueue, ^{
        [activeRequests removeObjectForKey:key];
    });
}

I'm seeing the deadlock in the call to activeRequestForRpc when I make a lot of network requests at once which leads me to believe that one of the barrier blocks (add or remove) is not completing execution. I always call activeRequestForRpc from a background thread, and the app UI doesn't freeze so I don't think it has to do blocking the main thread, but I added the assert statement just in case. Any ideas on how this deadlock could be happening?

UPDATE: adding code that is calling these methods

I'm using AFNetworking to make the network requests and I have an NSOperationQueue that I'm scheduling the 'check cache and maybe fetch resource from network' logic. I'll call that op the CheckCacheAndFetchFromNetworkOp. Inside that op I make a call out to my custom subclass of AFHTTPClient to make an RPC request.

// this is called from inside an NSOperation executing on an NSOperationQueue.
- (void) enqueueOperation: (MY_AFHTTPRequestOperation *) op {
    NSError *error = nil;
    if ([self activeRequestForRpc:op.netRequest.rpcRequest]) {
        error = [NSError errorWithDomain:kHttpRpcErrorDomain code:HttpRpcErrorDuplicate userInfo:nil];
    }
    // set the error on the op and cancels it so dependent ops can continue.
    [op setHttpRpcError:error];

    // Maybe enqueue the op
    if (!error) {
        [self addActiveRequest:op.netRequest.rpcRequest];
        [self enqueueHTTPRequestOperation:op];
    }
}

The MY_AFHTTRequestOperation is built by the AFHTTPClient instance and inside both the success and failure completion blocks I call [self removeActiveRequest:netRequest.rpcRequest]; as the first action. These blocks are executed on the main thread by AFNetworking as the default behavior.

I've seen the deadlock happen where the last barrier block that must be holding the lock on the queue is both the add block and the remove block.

Is it possible that as the system spawns more threads to support the CheckCacheAndFetchFromNetworkOp Ops in my NSOperationQueue, the activeRequestsQueue would be too low priority to get scheduled? That could cause deadlock if all threads were taken by CheckCacheAndFetchFromNetworkOps blocking to try and read from the activeRequests Dictionary, and the activeRequestsQueue was blocking on an add/remove barrier block that couldn't execute.

UPDATE

Fixed the issue by setting the NSOperationQueue to have maxConcurrentOperation count of 1 (or really anything reasonable other than the default NSOperationQueueDefaultMaxConcurrentOperationCount).

Basically the lesson I took away is that you shouldn't have an NSOperationQueue with the default max operation count wait on any other dispatch_queue_t or NSOperationQueue since it could potentially hog all threads from those other queues.

This is what was happening.

queue - NSOperationQueue set to default NSDefaultMaxOperationCount which lets system determine how many concurrent ops to run.

op - runs on queue1 and schedules a network request on the AFNetworking queue after reading to make sure the RPC isn't in the activeRequest set.

Here is the flow:

The system determines that it can support 10 concurrent threads (In reality it was more like 80).

10 ops get scheduled at once. The system lets 10 ops run concurrently on it's 10 threads. All 10 ops call hasActiveRequestForRPC which schedules a sync block on the activeRequestQueue and blocks the 10 threads. The activeRequestQueue wants to run it's read block, but doesn't have any available threads. At this point we already have a deadlock.

More commonly I would see something like 9 ops (1-9) get scheduled, one of them, op1, quickly runs a hasActiveRequestForRPC on the 10th thread and schedules an addActiveRequest barrer block. Then another op would get scheduled on the 10th thread and the op2-10 would schedule and wait on an hasActiveRequestForRPC. Then the op1's scheduled addRpc block wouldn't run since the op10 took up the last available thread, and all the other hasActiveRequestForRpc blocks would wait for the barrier block to execute. op1 would end up blocking later when it tried to schedule a cache operation on a different operation queue that also couldn't get access to any threads.

I was assuming that the blocking hasActiveRequestForRPC were waiting on a barrer block to execute, but the key was the activeRequestQueue waiting on any thread availability.

like image 981
Ted Tomlinson Avatar asked Dec 19 '12 23:12

Ted Tomlinson


1 Answers

EDIT: Turns out the problem was that the NSOperationQueue which is calling enqueueOperation: is using all available threads, so since they are all waiting (via dispatch_sync) for something to happen on the activeRequestsQueue. Reducing the maxConcurrentOperations on this queue solved the problem (see comments), though this is not really a great solution because it makes assumptions about the number of cores, etc. A better solution would be to use dispatch_async rather than dispatch_sync, though this will make the code more complex.

My earlier suggestions:

  • You're calling dispatch_sync(activeRequestsQueue, ...) when you're already on the activeRequestsQueue (and your assert isn't firing for some reason, like you're running in release.)

  • [activeRequests removeObjectForKey:key]; is causing a request to be deallocated, and the dealloc is waiting for something that calls activeRequestForRpc:, which would cause a deadlock.

like image 72
Jesse Rusak Avatar answered Oct 15 '22 20:10

Jesse Rusak