Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Safety nets in complex multi-threaded code?

As a developer who has just finished writing thousands of lines of complex multi-threaded 'C' code in a project, and which is going to be enhanced, modified etc. by several other developers unfamiliar with this code in the future, I wanted to find out what kind of safety nets do you guys try to put in such code? As an example I could do these:

  1. Define accessor macros for lock protected structure members, which assert that the corresponding lock is held. This makes it clear that these members are lock-protected to anyone unfamiliar with this code.
  2. Functions which are supposed to be called with some spinlock held, assert that the spinlock is being held.

What kind of safety nets have you put into multi-threaded code that you have written?
What kind of problems have you faced when other developers modified such code?
What kind of debugging aids have you put into such code?

Thanks for your comments.

like image 517
Sudhanshu Avatar asked Feb 28 '23 20:02

Sudhanshu


1 Answers

There are a number of things we do in our product (a hypervisor designed to help you find concurrency bugs in applications) that are more generally useful. Note that we do these in our code itself (because its a highly concurrent piece of software) and that some of these are useful whether or not you are writing concurrent code.

  • Like you, we have the ability to assert(lock_held(...)) and use it.

  • We also (because we have our own scheduler) can assert(single_threaded()) for those (rare) situations where we count on no other thread being active in the system.

  • Memory corruption from one thread to another is pretty common (and hard to debug) so we do two things to address this: sprinkled throughout our thread stack are some magic cookies. We periodically (in our get_thread_id()) function invoke a "validate_thread_stack()" function that checks these cookies to make sure the stack is not corrupted.

  • Our malloc sticks magic cookies before and after a malloc block of memory and checks these on free. If anyone overruns their data these can be used to find the corruption early.

  • On free() we blast a well known pattern (in our case 0xdddd...) over the memory. This nicely corrupts anyone else who had a dangling pointer left over to that memory region.

  • We have a guard page (a memory page not mapped into the address space) near the bottom of the thread stack. If the thread overruns its stack, we catch it via page fault and drop into our debugger.

  • Our locks are witnessed. Checkout the FreeBSD lock witness code. Its like that but homebrew. Basically the witness code is a lightweight way of detecting potential deadlocks by looking at cycles in the lock acquisition graph.

  • Our locks are also wrapped with accessors that record the file/line number of acquisition and release. For double unlocks or double locks, you get pretty debug information on your screwup.

  • Our locks are also profiled. Once you get your code working you want it working well. We track the usual things like how many acquisitions, how long it took to acquire it.

  • In our system, we have an expectation that locks are not contended (we carefully designed the code this way). So if you wait for a spin lock longer than a second or two in our system you get dropped into the debugger because its most likely not a good thing.

  • Our variables that are meant to be updated atomically are wrapped inside of C struct's. The reason for this is to prevent sloppy code where you mix good use: atomic_increment(&var); and bad use var++. We make it very hard to write the latter code.

  • "volatile" is forbidden in our code base because its ambiguously implemented by compilers. Its a bad way to try and cobble together synchronization.

  • And of course code reviews. If you can't explain your concurrency assumptions and locking discipline to a colleague, then there's definitely issues with the code :-)

like image 80
Mark oskin Avatar answered Mar 05 '23 17:03

Mark oskin