Note: I have heavily edited this question for clarity after making a mess of it brainstorming in public. However the actual algorithms described, and the question about whether they're sufficient to avoid races, should be identical.
I'm trying to implement semaphores that avoid the race condition described in glibc bug number 12674:
http://sourceware.org/bugzilla/show_bug.cgi?id=12674
Basically, the efficient way to write a futex-based semaphore if you don't care about this race condition for destruction is:
Post:
Wait:
Step 2 of the post operation is what's unsafe.
There are two obvious implementations which avoid this issue, but both have major problems:
The first solution is not to store a waiter count or flag at all, and always perform a futex wake on post. This is obviously safe, but defeats the whole purpose of futexes (keeping the uncontended case in userspace).
The second solution is not to store a waiter count, but instead let the semaphore value decrement to -1 on wait contention. The post operation then transitions from -1 to 1 and wakes all the waiters. One of them succeeds in decrementing the semaphore value to 0, and if any remain, they set the value to -1, so the next post will perform another wake. This solution is also obviously safe, but it results in a stampede of threads contending for the semaphore when it's posted.
In summary, the first solution only works well for semaphores that are always contended, and the latter only works well for semaphores that usually have no more than one waiter. Neither is acceptable for general use.
Now, on to an attempt at a real solution...
At this point, it should be noted that there are a few other requirements that complicate a real-world implementation. The post operation is required to be async-signal-safe (so it basically cannot use locks), and the wait operation is required to allow interruption by signals, timeout, or thread cancellation. In practice this means the post operation must be able to safely "back out" any changes it makes to the semaphore state. I have glossed over such issues because I my approach seems to have no problems with them, but they do make some otherwise-obvious changes/solutions impossible, so anyone suggesting new approaches as an answer should be aware of these issues...
I have a proposed solution, but I'm unsure whether it's subject to new (and possibly worse) race conditions. I will describe it here, and I hope some concurrency gods (or demigods at least) might have the kindness to review it for correctness.
My approach is something of a hybrid of the second "bad" solution I described above, and the original approach (with the race described in the glibc bug report). It uses both a waiter count and a waiter flag (-1 stored in the semaphore value).
The key change to the wait operation is that it stores -1 instead of 0 in the semaphore value whenever there are waiters (preexisting waiter count or itself as a new waiter).
Wait:
The key change to the post operation is that it performs the read on the waiter count before incrementing the semaphore value, rather than afterwards:
Post:
The compare-and-swap in step 4 provides some safety that the waiter count is still correct, but there's still an ABA race - between steps 1 and 4, it's possible that other threads perform wait and post operations that leave the semaphore value the same as its initial value.
In looking for cases where this algorithm might fail to wake waiters, we need only consider cases where the initial waiter count read is 0 and the semaphore value read is not -1. Further, if the semaphore value is positive and there are no preexisting waiters, the current post is not responsible for any wakeups, so this case is not interesting either. We're left examining cases where the wait operation begins with a zero semaphore value and zero wait count. In this situation, in order not to have a race condition, any event that happens between steps 2 and 4 that results in new waiters must change the semaphore value, so that the compare-and-swap at step 4 fails. Clearly any single intervening post or wait will change the semaphore value (to 1 or -1, respectively), so the case of concern, more specifically, is sequences of operations which result in a semaphore value of 0 but the presence of waiters.
I believe this cannot happen due to the procedure followed in the wait operation, but I haven't convinced myself 100%.
Finally, here are some examples of races that happen if you weaken my algorithm, in order to establish the motivations for what it's doing if that's not clear.
Failure 1: Using a pure wait count, no -1 flag in the semaphore value. Trivial race looks like:
Failure 2: Using waiter count and having new waiters set the semaphore value to -1, but simply decrementing the semaphore value when wait succeeds (without setting it to -1 if other threads are still waiting):
First of all, let me bring up two alternative approaches you may wish to consider.
Approach #1 (X86-specific, fast): CMPXCHG8B/CMPXCHG16B.
x86 platforms have a double-pointer-width atomic compare-and-exchange operation. On 32-bit this is 8 bytes; on 64-bit there's a CMPXCHG16B that atomically compares and exchanges a full 16 bytes of data. By using this you can atomically swap both wait count and semaphore count in a single operation. futex can only wait on one pointer-size field, but this shouldn't be a severe problem in this case.
Approach #2 (portable, limited): Packed counts.
If a limit of 2^16 for waiter and semaphore counts is acceptable, simply pack both counts in a single 32-bit field.
Approach #3 (portable, has some overhead): Use a semaphore-in-a-semaphore to protect the post race.
Reserve 8 bits of the semaphore count for a lock over post operations. The post operation will increment this counter (blocking if it would overflow) at the same time as it increments the true semaphore count. It will then do its work with the waiters field, and then decrement the lock counter atomically. After decrementing, if the previous value was 255, wake up all waiters (although this causes a thundering herd, it should be extremely rare).
Upon deletion, acquire the lock 255 times (you can increment by more than one in one step), blocking as necessary. Once the lock has been acquired 255 times, you know that all posts have completed, and it is safe to delete the lock.
Downside: Posts now require two atomic compare-exchanges, and the maximum semaphore count is 2^24-1. Additionally, recursively entering an asynchronous signal handler 255 times will deadlock.
These approaches are simpler, easier to prove correct, and likely faster. However their limitations may mean they are unacceptable for your case (the CMPXCHG8B approach should work quite well on x86 however). And here's one more:
Approach #4 (kinda-arch independent; complex; fast): Modify the kernel
One option here would be to modify the kernel to allow for a low-overhead, safe method for reading the waiter field without resulting in a segfault in the event of the memory being freed. For example, you could add a syscall that registers a thread-specific data structure; in that thread-specific data page you could have a 'fault handler jump address'. In the event that the program segfaults, if the jump address is nonzero, the kernel simply jumps there instead of raising SIGSEGV. Alternately, you could have a similar mechanism to simply suppress the offending instruction.
Now all you have to do is:
And there you go - you get the fast algorithm, but you also get protection against the deletion race. But you have to hack the kernel's segv handlers to do it. It might be worth looking at SEH on Windows; a similar mechanism would work very well here.
In any cast, I don't see anything wrong with your approach offhand, but I may be missing something. It might be good to raise it on appropriate mailing lists, and to consult with the futex maintainers; they would probably be interested in implementing support in the kernel to make this easier for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With