Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Safely "lend" memory block to another thread in C, assuming no "concurrent access"

The problem

I want to allocate memory in one thread, and safely "lend" the pointer to another thread so it can read that memory.

I'm using a high level language that translates to C. The high level language has threads (of unspecified threading API, since it's cross-platform -- see below) and supports standard C multi-threading primitives, like atomic-compare-exchange, but it's not really documented (no usage examples). The constraints of this high-level language are:

  • Each thread executes an event-processing infinite loop.
  • Each thread has it's own local heap, managed by some custom allocator.
  • Each thread has one "input" message queue, that can contain messages from any number of different other threads.
  • The message passing queues are:
    1. For fixed-type messages
    2. Using copying

Now this is impractical for large (don't want the copy) or variable-sized (I think array-size is part of the type) messages. I want to send such messages, and here's the outline of how I want to achieve it:

  • A message (either a request or a reply) can either store the "payload" inline (copied, fixed limit on total values size), or a pointer to data in the sender's heap
  • The message contents (data in sender's heap) is owned by the sending thread (allocate and free)
  • The receiving thread sends an ack to the sending thread when they are done with the message content
  • The "sending" threads must not modify the message contents after sending them, until receiving the (ack).
  • There should never be a concurrent read access on memory being written to, before the writing is done. This should be guaranteed by the message queues work-flow.

I need to know how to ensure that this works without data races. My understanding is that I need to use memory fences, but I'm not entirely sure which one (ATOMIC_RELEASE, ...) and where in the loop (or if I need any at all).


Portability considerations

Because my high-level language needs to be cross-platform, I need the answer to work on:

  • Linux, MacOS, and optionally Android and iOS
    • using pthreads primitives to lock message queues: pthread_mutex_init and pthread_mutex_lock + pthread_mutex_unlock
  • Windows
    • using Critical Section Objects to lock message queues: InitializeCriticalSection, and EnterCriticalSection + LeaveCriticalSection

If it helps, I'm assuming the following architectures:

  • Intel/AMD PC architecture for Windows/Linux/MacOS(?).
  • unknown (ARM?) for iOS and Android

And using the following compilers (you can assume a "recent" version of all of them):

  • MSVC on Windows
  • clang on Linux
  • Xcode On MacOS/iOS
  • CodeWorks for Android on Android

I've only built on Windows so far, but when the app is done, I want to port it to the other platforms with minimal work. Therefore I'm trying to ensure cross-platform compatibility from the start.


Attempted Solution

Here is my assumed work-flow:

  1. Read all the messages from the queue, until it's empty (only block if it was totally empty).
  2. Call some "memory fence" here?
  3. Read the messages contents (target of pointers in messages), and process the messages.
    • If the message is a "request", it can be processed, and new messages buffered as "replies".
    • If the message is a "reply", the message content of the original "request" can be freed (implicit request "ack").
    • If the message is a "reply", and it itself contains a pointer to "reply content" (instead of an "inline reply"), then a "reply-ack" must be sent too.
  4. Call some "memory fence" here?
  5. Send all the buffered messages into the appropriate message queues.

Real code is too large to post. Here is simplified (just enough to show how the shared memory is accessed) pseudocode using a mutex (like the message queues):

static pointer p = null
static mutex m = ...
static thread_A_buffer = malloc(...)

Thread-A:
  do:
    // Send pointer to data
    int index = findFreeIndex(thread_A_buffer)
    // Assume different value (not 42) every time
    thread_A_buffer[index] = 42
    // Call some "memory fence" here (after writing, before sending)?
    lock(m)
    p = &(thread_A_buffer[index])
    signal()
    unlock(m)
    // wait for processing
    // in reality, would wait for a second signal...
    pointer p_a = null
    do:
      // sleep
      lock(m)
      p_a = p
      unlock(m)
    while (p_a != null)
    // Free data
    thread_A_buffer[index] = 0
    freeIndex(thread_A_buffer, index)
  while true

Thread-B:
  while true:
    // wait for data
    pointer p_b = null
    while (p_b == null)
      lock(m)
      wait()
      p_b = p
      unlock(m)
    // Call some "memory fence" here (after receiving, before reading)?
    // process data
    print *p_b
    // say we are done
    lock(m)
    p = null
    // in reality, would send a second signal...
    unlock(m)

Would this solution work? Reformulating the question, does Thread-B print "42"? Always, on all considered platforms and OS (pthreads and Windows CS)? Or do I need to add other threading primitives such as memory fences?


Research

I've spent hours looking at many related SO questions, and read some articles, but I'm still not totally sure. Based on @Art comment, I probably don't need to do anything. I believe this is based on this statement from the POSIX standard, 4.12 Memory Synchronization:

[...] using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads.

My problem is that this sentence doesn't clearly specify if they mean "all the accessed memory", or "only the memory accessed between lock and unlock." I have read people arguing for both cases, and even some implying it was written imprecisely on purpose, to give compiler implementers more leeway in their implementation!

Furthermore, this applies to pthreads, but I need to know how it applies to Windows threading as well.

I'll choose any answer that, based on quotes/links from either a standard documentation, or some other highly reliable source, either proves that I don't need fences or shows which fences I need, under the aforementioned platform configurations, at least for the Windows/Linux/MacOS case. If the Windows threads behave like the pthreads in this case, I'd like a link/quote for that too.

The following are some (of the best) related questions/links I read, but the presence of conflicting information causes me to doubt my understanding.

  • Does pthread_mutex_lock contains memory fence instruction?
  • Memory Fences - Need help to understand
  • Problem with pThread sync issue
  • Memory Visibility Through pthread Library?
  • clarifications on full memory barriers involved by pthread mutexes
  • Memory model spec in pthreads
  • http://www.hpl.hp.com/techreports/2005/HPL-2005-217R1.html
  • http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11
  • https://msdn.microsoft.com/en-us/library/windows/desktop/ms684208(v=vs.85).aspx
like image 318
Sebastien Diot Avatar asked Nov 03 '17 07:11

Sebastien Diot


1 Answers

My review of the documentation of C++11 and similar wording in C11:n1570.pdf lead me to the following understanding.

Data is safely usable between threads, if some form of co-operative synchronization is performed between the threads. If there was a queue, which within a mutex read an item from the queue, and if items were added to the queue whilst the mutex was held, then the memory readable in the second thread would be the memory which had been written to in the first thread.

That is because the compiler and underlying CPU infrastructure are not allowed to organize side effects which pass through the ordering.

From n1570

An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, A is dependency-ordered before B, or, for some evaluation X:

— A synchronizes with X and X is sequenced before B,

— A is sequenced before X and X inter-thread happens before B, or

— A inter-thread happens before X and X inter-thread happens before B

So to ensure the memory visible in the new thread is consistent, then the following would guarantee the results.

  • A Mutex accessing the lock
  • An interlocked write at the producer + an interlocked read at the consumer

The interlocked write, causes all the preceding operations on thread A to be sequenced and be cache-flushed before thread B sees the read.

After the data is written to the queue for "other thread processing", then the first thread can't safely (unlocked) modify or read any of the memory in the object, until it knows (through some mechanism) that the other thread is not accessing the data any more. It will only see the correct results, if this is done through some synchronization mechanism.

Both the C++ and C standards are intended to formalize existing behavior of compilers and CPUs. So although there are less formal guarantees in the use of pthreads, and C99 standards, these would be expected to be consistent.

From your example

Thread A

int index = findFreeIndex(thread_A_buffer)

This line is problematic, as it doesn't show any synchronization primitives. If the mechanism for findFreeIndex only relies on memory which is written by thread A, then this will work. If thread B, or any other thread modifies the memory, then a further lock is necessary.

lock(m)
p = &(thread_A_buffer[index])
signal()
unlock(m)

This is covered by ....

15 An evaluation A is dependency-ordered before an evaluation B if

— A performs a release operation on an atomic object M, and, in another thread, B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A, or

— for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.

and

18 An evaluation A happens before an evaluation B if A is sequenced before B or A interthread happens before B.

The operations before the synchronization "happen before" the synchronization, and are guaranteed to be visible after the synchronization in the other thread.

The lock (an acquire), and unlock (a release), ensure there is a strict ordering on the information in thread A being completed and visible to B.

thread_A_buffer[index] = 42;      // happens before 

At the moment the memory thread_A_buffer is visible on A, but to read it on B causes undefined behavior.

lock(m);  // acquire

Although needed for the release, I can't see any results from the acquire.

p = &thread_A_buffer[index];
unlock(m);

All the instruction stream of A is now visible to B (due to its synchronization with m).

thread_A_buffer[index] = 42;  << This happens before and ...
p = &thread_A_buffer[index];  << carries a dependency into p
unlock(m);

All the stuff in A is now visible to B because

An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, A is dependency-ordered before B, or, for some evaluation X

— A synchronizes with X and X is sequenced before B,

— A is sequenced before X and X inter-thread happens before B, or

— A inter-thread happens before X and X inter-thread happens before B.

pointer p_a = null
do:
  // sleep
  lock(m)
  p_a = p
  unlock(m)
while (p_a != null)

This code is completely safe, the value read into p_a will be ordered with the other thread, and will be not-null after the syncronized write in thread b. Again, the lock/unlock cause a strict ordering which ensure the read value will be the written value.

All of thread B's interactions are within a lock, so again completely safe.

If A were to modify the object after it had given the object to B, then it would not work, unless there was some further synchronization.

like image 179
mksteve Avatar answered Nov 05 '22 21:11

mksteve