<p>I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed. </p> <p>The memory is two two-dimensional arrays:</p> <pre class="prettyprint"><code>volatile uint16_t memoryBuffer[2][10][20] = {0}; </code></pre> <p>The DMA operates on the opposite "matrix" than the program function:</p> <pre class="prettyprint"><code>void myTask(uint8_t indexOppositeOfDMA) { for(uint8_t n=0; n<10; n++) { for(uint8_t m=0; m<20; m++) { //Do some stuff with memory (readings only): foo(memoryBuffer[indexOppositeOfDMA][n][m]); } } } </code></pre> <p>Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?</p> <p>Platform Cortex-M4</p>

<h3>The problem without volatile</h3> <p>Let's assume that <code>volatile</code> is omitted from the data array. Then the C compiler and the CPU do not know that its elements change outside the program-flow. Some things that <em>could</em> happen then:</p> <ul> <li><p>The whole array might be loaded into the cache when <code>myTask()</code> is called for the first time. The array might stay in the cache forever and is never updated from the "main" memory again. This issue is more pressing on multi-core CPUs if <code>myTask()</code> is bound to a single core, for example.</p></li> <li><p>If <code>myTask()</code> is inlined into the parent function, the compiler might decide to hoist loads outside of the loop even to a point where the DMA transfer has not been completed.</p></li> <li><p>The compiler might even be able to determine that no write happens to <code>memoryBuffer</code> and assume that the array elements stay at 0 all the time (which would again trigger a lot of optimizations). This could happen if the program was rather small and all the code is visible to the compiler at once (or LTO is used). <em>Remember</em>: After all the compiler does not know anything about the DMA peripheral and that it is writing "unexpectedly and wildly into memory" (from a compiler perspective).</p></li> </ul> <p>If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the <code>volatile</code> declaration. But it also might not...</p> <h3>The problem with volatile</h3> <p>Making the whole array <code>volatile</code> is often a pessimisation. For speed reasons you probably want to unroll the loop. So instead of loading from the array and incrementing the index alternatingly such as</p> <pre class="prettyprint"><code>load memoryBuffer[m] m += 1; load memoryBuffer[m] m += 1; load memoryBuffer[m] m += 1; load memoryBuffer[m] m += 1; </code></pre> <p>it can be faster to load multiple elements at once and increment the index in larger steps such as</p> <pre class="prettyprint"><code>load memoryBuffer[m] load memoryBuffer[m + 1] load memoryBuffer[m + 2] load memoryBuffer[m + 3] m += 4; </code></pre> <p>This is especially true, if the loads can be fused together (e.g. to perform one 32-bit load instead of two 16-bit loads). Further you want the compiler to use SIMD instruction to process multiple array elements with a single instruction.</p> <p>These optimizations are often prevented if the load happens from volatile memory because compilers are usually very conservative with load/store reordering around volatile memory accesses. Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).</p> <h3>Possible solution 1: fences</h3> <p>So you would like to make the array non-volatile but add a hint for the compiler/CPU saying <em>"when you see this line (execute this statement), flush the cache and reload the array from memory"</em>. In C11 you could insert an atomic_thread_fence at the beginning of <code>myTask()</code>. Such fences prevent the re-ordering of loads/stores across them.</p> <p>Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a <code>__dmb()</code> intrinsic (data memory barrier). For GCC you may want to look at <code>__sync_synchronize()</code> (doc).</p> <h3>Possible solution 2: atomic variable holding the buffer state</h3> <p>We use the following pattern a lot in our codebase (e.g. when reading data from SPI via DMA and calling a function to analyze it): The buffer is declared as plain array (no <code>volatile</code>) and an atomic flag is added to each buffer, which is set when the DMA transfer has finished. The code looks something like this:</p> <pre class="prettyprint"><code>typedef struct Buffer { uint16_t data[10][20]; // Flag indicating if the buffer has been filled. Only use atomic instructions on it! int filled; // C11: atomic_int filled; // C++: std::atomic_bool filled{false}; } Buffer_t; Buffer_t buffers[2]; Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy void setupDMA(void) { for (int i = 0; i < 2; ++i) { int bufferFilled; // Atomically load the flag. bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0); // C11: bufferFilled = atomic_load(&buffers[i].filled); // C++: bufferFilled = buffers[i].filled; if (!bufferFilled) { currentDmaBuffer = &buffers[i]; ... configure DMA to write to buffers[i].data and start it } } // If you end up here, there is no free buffer available because the // data processing takes too long. } void DMA_done_IRQHandler(void) { // ... stop DMA if needed // Atomically set the flag indicating that the buffer has been filled. __sync_fetch_and_or(&currentDmaBuffer->filled, 1); // C11: atomic_store(&currentDmaBuffer->filled, 1); // C++: currentDmaBuffer->filled = true; currentDmaBuffer = 0; // ... possibly start another DMA transfer ... } void myTask(Buffer_t* buffer) { for (uint8_t n=0; n<10; n++) for (uint8_t m=0; m<20; m++) foo(buffer->data[n][m]); // Reset the flag atomically. __sync_fetch_and_and(&buffer->filled, 0); // C11: atomic_store(&buffer->filled, 0); // C++: buffer->filled = false; } void waitForData(void) { // ... see setupDma(void) ... } </code></pre> <p>The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more, make the incoming data slower or the processing code faster or whatever is sufficient in your case.</p> <h3>Possible solution 3: OS support</h3> <p>If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on the queue until it is non-empty. The pattern looks a bit like this:</p> <pre class="prettyprint"><code>MemoryPool pool; // A pool to acquire DMA buffers. Queue bufferQueue; // A queue for pointers to buffers filled by the DMA. void* volatile currentBuffer; // The buffer currently filled by the DMA. void setupDMA(void) { currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t)); // ... make the DMA write to currentBuffer } void DMA_done_IRQHandler(void) { // ... stop DMA if needed Queue_Post(&bufferQueue, currentBuffer); currentBuffer = 0; } void myTask(void) { void* buffer = Queue_Wait(&bufferQueue); [... work with buffer ...] MemoryPool_Deallocate(&pool, buffer); } </code></pre> <p>This is probably the easiest approach to implement but only if you have an OS and if portability is not an issue.</p>

Improve performance of reading volatile memory

Tags:

performance

c

embedded

volatile

dma

I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed.

The memory is two two-dimensional arrays:

volatile uint16_t memoryBuffer[2][10][20] = {0};

The DMA operates on the opposite "matrix" than the program function:

void myTask(uint8_t indexOppositeOfDMA)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      //Do some stuff with memory (readings only):
      foo(memoryBuffer[indexOppositeOfDMA][n][m]);
    }
  }
}

Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?

Platform Cortex-M4

817

asked Feb 09 '17 13:02

Dennis Kirkegaard

1 Answers

The problem without volatile

Let's assume that volatile is omitted from the data array. Then the C compiler and the CPU do not know that its elements change outside the program-flow. Some things that could happen then:

The whole array might be loaded into the cache when myTask() is called for the first time. The array might stay in the cache forever and is never updated from the "main" memory again. This issue is more pressing on multi-core CPUs if myTask() is bound to a single core, for example.
If myTask() is inlined into the parent function, the compiler might decide to hoist loads outside of the loop even to a point where the DMA transfer has not been completed.
The compiler might even be able to determine that no write happens to memoryBuffer and assume that the array elements stay at 0 all the time (which would again trigger a lot of optimizations). This could happen if the program was rather small and all the code is visible to the compiler at once (or LTO is used). Remember: After all the compiler does not know anything about the DMA peripheral and that it is writing "unexpectedly and wildly into memory" (from a compiler perspective).

If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...

The problem with volatile

Making the whole array volatile is often a pessimisation. For speed reasons you probably want to unroll the loop. So instead of loading from the array and incrementing the index alternatingly such as

load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;

it can be faster to load multiple elements at once and increment the index in larger steps such as

load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;

This is especially true, if the loads can be fused together (e.g. to perform one 32-bit load instead of two 16-bit loads). Further you want the compiler to use SIMD instruction to process multiple array elements with a single instruction.

These optimizations are often prevented if the load happens from volatile memory because compilers are usually very conservative with load/store reordering around volatile memory accesses. Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).

Possible solution 1: fences

So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.

Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).

Possible solution 2: atomic variable holding the buffer state

We use the following pattern a lot in our codebase (e.g. when reading data from SPI via DMA and calling a function to analyze it): The buffer is declared as plain array (no volatile) and an atomic flag is added to each buffer, which is set when the DMA transfer has finished. The code looks something like this:

typedef struct Buffer
{
    uint16_t data[10][20];
    // Flag indicating if the buffer has been filled. Only use atomic instructions on it!
    int filled;
    // C11: atomic_int filled;
    // C++: std::atomic_bool filled{false};
} Buffer_t;

Buffer_t buffers[2];

Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy

void setupDMA(void)
{
    for (int i = 0; i < 2; ++i)
    {
        int bufferFilled;
        // Atomically load the flag.
        bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
        // C11: bufferFilled = atomic_load(&buffers[i].filled);
        // C++: bufferFilled = buffers[i].filled;

        if (!bufferFilled)
        {
            currentDmaBuffer = &buffers[i];
            ... configure DMA to write to buffers[i].data and start it
        }
    }

    // If you end up here, there is no free buffer available because the
    // data processing takes too long.
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    // Atomically set the flag indicating that the buffer has been filled.
    __sync_fetch_and_or(&currentDmaBuffer->filled, 1);
    // C11: atomic_store(&currentDmaBuffer->filled, 1);
    // C++: currentDmaBuffer->filled = true;

    currentDmaBuffer = 0;
    // ... possibly start another DMA transfer ...
}

void myTask(Buffer_t* buffer)
{
    for (uint8_t n=0; n<10; n++)
        for (uint8_t m=0; m<20; m++)
            foo(buffer->data[n][m]);

    // Reset the flag atomically.
    __sync_fetch_and_and(&buffer->filled, 0);
    // C11: atomic_store(&buffer->filled, 0);
    // C++: buffer->filled = false;
}

void waitForData(void)
{
    // ... see setupDma(void) ...
}

The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more, make the incoming data slower or the processing code faster or whatever is sufficient in your case.

Possible solution 3: OS support

If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on the queue until it is non-empty. The pattern looks a bit like this:

MemoryPool pool;              // A pool to acquire DMA buffers.
Queue bufferQueue;            // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.

void setupDMA(void)
{
    currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
    // ... make the DMA write to currentBuffer
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    Queue_Post(&bufferQueue, currentBuffer);
    currentBuffer = 0;
}

void myTask(void)
{
    void* buffer = Queue_Wait(&bufferQueue);
    [... work with buffer ...]
    MemoryPool_Deallocate(&pool, buffer);
}

This is probably the easiest approach to implement but only if you have an OS and if portability is not an issue.

182

answered Sep 21 '22 04:09

Mehrwolf

Related questions
                            
                                How can GCC unroll a loop if its number of iterations is unknown at compile time?
                            
                                Hanging of XShmPutImage event notification
                            
                                Calling Py_Finalize() from C
                            
                                SSE register return with SSE disabled
                            
                                Fake anonymous functions in C
                            
                                Generate Call-Tree from cscope database
                            
                                Macro definitions for headers, where to put them?
                            
                                Forcing unaligned bitfield packing in MSVC
                            
                                Enabling strict floating point mode in GCC
                            
                                zlib, deflate: How much memory to allocate?
                            
                                Looking for sse 128 bit shift operation for non-immediate shift value
                            
                                n-th order Bezier Curves?
                            
                                Adding a Service to Name Service Switch
                            
                                JNI Android - Converting char* to byte array and return it to java
                            
                                What does "fasttop" mean?
                            
                                Why is copying a file in C so much faster than C++?
                            
                                Why does including a header using the full path lead to better error messages?
                            
                                Memory alignment today and 20 years ago
                            
                                Is casting a pointer to different structs guaranteed to be meaningful in C89?
                            
                                Taking address of temporary (compound literal) parameter in C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With