Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve performance of reading volatile memory

I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed.

The memory is two two-dimensional arrays:

volatile uint16_t memoryBuffer[2][10][20] = {0};

The DMA operates on the opposite "matrix" than the program function:

void myTask(uint8_t indexOppositeOfDMA)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      //Do some stuff with memory (readings only):
      foo(memoryBuffer[indexOppositeOfDMA][n][m]);
    }
  }
}

Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?

Platform Cortex-M4

like image 817
Dennis Kirkegaard Avatar asked Feb 09 '17 13:02

Dennis Kirkegaard


People also ask

How does volatile memory affect system performance?

Volatile memory does not affect a system's performance. A higher amount of storage space for cache, RAM, and other volatile memory increases the efficiency of a computer system. Non-volatile memory also affects a system's performance and storage. A higher amount of storage space lets a user save more data permanently.

What are the benefits of a volatile memory?

Being the primary source of memory, volatile memory has some advantages. First, it is fast, so data can be quickly accessed. Second, it protects sensitive data because the data becomes unavailable once the system is turned off. Finally, because of its high speed, volatile memory makes data transfer much easier.

How can you avoid losing data in volatile memory?

To avoid losing this volatile storage on a mobile device, keep this continuously charged to avoid losing volatile memory. A computer system will lose volatile memory when this is powered down, so the only way to safeguard this evidence is to leave the system powered up until a forensics expert can salvage this memory.

Why is volatile memory faster?

Volatile memory is used for a computer's RAM because it is much faster to read from and write to than today's nonvolatile memory devices. Even the latest storage class memory (SCM) devices such as Intel Optane can't match the performance of the current RAM modules, especially the processor cache.


1 Answers

The problem without volatile

Let's assume that volatile is omitted from the data array. Then the C compiler and the CPU do not know that its elements change outside the program-flow. Some things that could happen then:

  • The whole array might be loaded into the cache when myTask() is called for the first time. The array might stay in the cache forever and is never updated from the "main" memory again. This issue is more pressing on multi-core CPUs if myTask() is bound to a single core, for example.

  • If myTask() is inlined into the parent function, the compiler might decide to hoist loads outside of the loop even to a point where the DMA transfer has not been completed.

  • The compiler might even be able to determine that no write happens to memoryBuffer and assume that the array elements stay at 0 all the time (which would again trigger a lot of optimizations). This could happen if the program was rather small and all the code is visible to the compiler at once (or LTO is used). Remember: After all the compiler does not know anything about the DMA peripheral and that it is writing "unexpectedly and wildly into memory" (from a compiler perspective).

If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...

The problem with volatile

Making the whole array volatile is often a pessimisation. For speed reasons you probably want to unroll the loop. So instead of loading from the array and incrementing the index alternatingly such as

load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;

it can be faster to load multiple elements at once and increment the index in larger steps such as

load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;

This is especially true, if the loads can be fused together (e.g. to perform one 32-bit load instead of two 16-bit loads). Further you want the compiler to use SIMD instruction to process multiple array elements with a single instruction.

These optimizations are often prevented if the load happens from volatile memory because compilers are usually very conservative with load/store reordering around volatile memory accesses. Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).

Possible solution 1: fences

So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.

Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).

Possible solution 2: atomic variable holding the buffer state

We use the following pattern a lot in our codebase (e.g. when reading data from SPI via DMA and calling a function to analyze it): The buffer is declared as plain array (no volatile) and an atomic flag is added to each buffer, which is set when the DMA transfer has finished. The code looks something like this:

typedef struct Buffer
{
    uint16_t data[10][20];
    // Flag indicating if the buffer has been filled. Only use atomic instructions on it!
    int filled;
    // C11: atomic_int filled;
    // C++: std::atomic_bool filled{false};
} Buffer_t;

Buffer_t buffers[2];

Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy

void setupDMA(void)
{
    for (int i = 0; i < 2; ++i)
    {
        int bufferFilled;
        // Atomically load the flag.
        bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
        // C11: bufferFilled = atomic_load(&buffers[i].filled);
        // C++: bufferFilled = buffers[i].filled;

        if (!bufferFilled)
        {
            currentDmaBuffer = &buffers[i];
            ... configure DMA to write to buffers[i].data and start it
        }
    }

    // If you end up here, there is no free buffer available because the
    // data processing takes too long.
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    // Atomically set the flag indicating that the buffer has been filled.
    __sync_fetch_and_or(&currentDmaBuffer->filled, 1);
    // C11: atomic_store(&currentDmaBuffer->filled, 1);
    // C++: currentDmaBuffer->filled = true;

    currentDmaBuffer = 0;
    // ... possibly start another DMA transfer ...
}

void myTask(Buffer_t* buffer)
{
    for (uint8_t n=0; n<10; n++)
        for (uint8_t m=0; m<20; m++)
            foo(buffer->data[n][m]);

    // Reset the flag atomically.
    __sync_fetch_and_and(&buffer->filled, 0);
    // C11: atomic_store(&buffer->filled, 0);
    // C++: buffer->filled = false;
}

void waitForData(void)
{
    // ... see setupDma(void) ...
}

The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more, make the incoming data slower or the processing code faster or whatever is sufficient in your case.

Possible solution 3: OS support

If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on the queue until it is non-empty. The pattern looks a bit like this:

MemoryPool pool;              // A pool to acquire DMA buffers.
Queue bufferQueue;            // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.

void setupDMA(void)
{
    currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
    // ... make the DMA write to currentBuffer
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    Queue_Post(&bufferQueue, currentBuffer);
    currentBuffer = 0;
}

void myTask(void)
{
    void* buffer = Queue_Wait(&bufferQueue);
    [... work with buffer ...]
    MemoryPool_Deallocate(&pool, buffer);
}

This is probably the easiest approach to implement but only if you have an OS and if portability is not an issue.

like image 182
Mehrwolf Avatar answered Sep 21 '22 04:09

Mehrwolf