Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is accessing statically or dynamically allocated memory faster?

There are 2 ways of allocating global array in C:

  1. statically

    char data[65536];
    
  2. dynamically

    char *data;
    …
    data = (char*)malloc(65536);  /* or whatever size */
    

The question is, which method has better performance? And by how much?

As understand it, the first method should be faster.

Because with the second method, to access the array you have to dereference element's address each time it is accessed, like this:

  1. read the variable data which contains the pointer to the beginning of the array
  2. calculate the offset to specific element
  3. access the element

With the first method, the compiler hard-codes the address of the data variable into the code, first step is skipped, so we have:

  1. calculate the offset to specific element from fixed address defined at compile time
  2. access the element of the array

Each memory access is equivalent to about 40 CPU clock cycles, so , using dynamic allocation, specially for not frequent reads can have significant performance decrease vs static allocation because the data variable may be purged from the cache by some more frequently accessed variable. On the contrary , the cost of dereferencing statically allocated global variable is 0, because its address is already hard-coded in the code.

Is this correct?

like image 299
Nulik Avatar asked Jan 18 '16 05:01

Nulik


3 Answers

Calculating offsets is not a big performance issue. You have to consider how you will actually use the array in your code. You'll most likely write something like data[i] = x; and then no matter where data is stored, the program has to load a base address and calculate an offset.

The scenario where the compiler can hard code the address in case of the statically allocated array only happens when you write something like data[55] = x; which is probably a far less likely use case.

At any rate we are talking about a few CPU ticks here and there. It's not something you should go chasing by attempting manual optimization.

Each memory access is equivalent to about 40 CPU clock cycles

What!? What CPU is that? Some pre-ancient computer from 1960?

Regarding cache memory, those concerns may be more valid. It is possible that statically allocated memory utilizes data cache better, but that's just speculation and you'd have to have a very specific CPU in mind to have that discussion.


There is however a significant performance difference between static and dynamic allocation, and that is the allocation itself. For each call to malloc there is a call to the OS API, which in turn runs search function going through the heap and looking for for a free segment. The library also needs to keep track of the address to that segment internally, so that when you call free() it knows how much memory to release. Also, the more you call malloc/free, the more segmented the heap will become.

like image 27
Lundin Avatar answered Oct 19 '22 07:10

Lundin


One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536] and char *data_p = malloc(65536)


If the access is sporadic the static/global will be slightly faster:

// slower because we must fetch data_p and then store
void
datasetp(int idx,char val)
{

    data_p[idx] = val;
}

// faster because we can store directly
void
datasets(int idx,char val)
{

    data_s[idx] = val;
}

Now, if we consider caching, datasetp and datasets will be about the same [after the first access], because the fetch of data_p will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.


However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p at the start of the loop and put it in a register:

void
datasetalls(char val)
{
    int idx;

    for (idx = 0;  idx < 65536;  ++idx)
        data_s[idx] = val;
}

void
datasetallp(char val)
{
    int idx;

    for (idx = 0;  idx < 65536;  ++idx)
        data_p[idx] = val;
}

void
datasetallp_optimized(char val)
{
    int idx;
    register char *reg;

    // the optimizer will generate the equivalent code to this
    reg = data_p;

    for (idx = 0;  idx < 65536;  ++idx)
        reg[idx] = val;
}

If the access is so sporadic that data_p gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.

If such eviction occurs, the actual data array will, most likely, be evicted as well.

A much larger array might have more of an effect (e.g. if instead of 65536, we had 100000000, then mere traversal will evict data_p and by the time we reached the end of the array, the leftmost entries would already be evicted.

But, in that case, the extra fetch of data_p would be 0.000001% overhead.

So, it helps to either benchmark [or model] the particular use case/access pattern.


UPDATE:

Based on some further experimentation [triggered by a comment by Peter], the datasetallp function does not optimize to the equivalent of datasetallp_optimized for certain conditions, due to strict aliasing considerations.

Because the arrays are char [or unsigned char], the compiler generates a data_p fetch on each loop iteration. Note that if the arrays are not char (e.g. int), the optimization does occur and data_p is fetched only once, because char can alias anything but int is more limited.

If we change char *data_p to char *restrict data_p we get the optimized behavior. Adding restrict tells the compiler that data_p will not alias anything [even itself], so it's safe to optimize the fetch.


Personal note: While I understand this, to me, it seems goofy that without restrict, the compiler must assume that data_p can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p pointing to itself or that data_p is part of a struct:

// simplest
char *data_p = malloc(65536);
data_p = (void *) &data_p;

// closer to real world
struct mystruct {
    ...
    char *data_p;
    ...
};
struct mystruct mystruct;
mystruct.data_p = (void *) &mystruct;

These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].


For myself, I always hand code the equivalent of datasetallp_optimized [sans register], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p only once".


Also, the multifetch problem does not occur when using data_p for input [because we're not modifying anything, aliasing is not a consideration]:

// does fetch of data_p once at loop start
int
datasumallp(void)
{
    int idx;
    int sum;

    sum = 0;
    for (idx = 0;  idx < 65536;  ++idx)
        sum += data_p[idx];

    return sum;
}

But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s or data_p] is often less useful than passing the array address as an argument.

Side note: clang would optimize some of the functions using data_s into memset calls, so, during experimentation, I modified the example code slightly to prevent this.

void
dataincallx(array_t *data,int val)
{
    int idx;

    for (idx = 0;  idx < 65536;  ++idx)
        data[idx] = val + idx;
}

This does not suffer from the multifetch problem. That is, dataincallx(data_s,17) and dataincallx(data_p,37) work about the same [with the initial extra data_p fetch]. This is more likely to be what one might use in general [better code reuse, etc].


So, the distinction between data_s and data_p becomes a bit more of a moot point. Coupled with judicious use of restrict [or using types other than char], the data_p fetch overhead can be minimized to the point where it isn't really that much of a consideration.

It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.

This is use case dependent.

If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.

However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s or data_p] directly.

like image 85
Craig Estey Avatar answered Oct 19 '22 07:10

Craig Estey


I think that data locality is much more of an issue than computing the base address of the array. (I could imagine cases where accessing the pointer contents is extremely fast because it is in a register while the offset to the stack pointer or text segment is a compile time constant; accessing a register may be faster.)

But the real issue will be data locality, which is often a reason to be careful with dynamic memory in performance critical tight loops. If you have more dynamically allocated data which happens to be close to your array, chances are the memory will remain in the cache. If you have data scattered all over the RAM allocated at different times, you may have many cache misses accessing them. In that case it would be better to allocate them statically (or on the stack) next to each other, if possible.

like image 25
Peter - Reinstate Monica Avatar answered Oct 19 '22 07:10

Peter - Reinstate Monica