There are 2 ways of allocating global array in C:
statically
char data[65536];
dynamically
char *data;
…
data = (char*)malloc(65536); /* or whatever size */
The question is, which method has better performance? And by how much?
As understand it, the first method should be faster.
Because with the second method, to access the array you have to dereference element's address each time it is accessed, like this:
data
which contains the pointer to the beginning of the arrayWith the first method, the compiler hard-codes the address of the data
variable into the code, first step is skipped, so we have:
Each memory access is equivalent to about 40 CPU clock cycles, so , using dynamic allocation, specially for not frequent reads can have significant performance decrease vs static allocation because the data
variable may be purged from the cache by some more frequently accessed variable. On the contrary , the cost of dereferencing statically allocated global variable is 0, because its address is already hard-coded in the code.
Is this correct?
Calculating offsets is not a big performance issue. You have to consider how you will actually use the array in your code. You'll most likely write something like data[i] = x;
and then no matter where data
is stored, the program has to load a base address and calculate an offset.
The scenario where the compiler can hard code the address in case of the statically allocated array only happens when you write something like data[55] = x;
which is probably a far less likely use case.
At any rate we are talking about a few CPU ticks here and there. It's not something you should go chasing by attempting manual optimization.
Each memory access is equivalent to about 40 CPU clock cycles
What!? What CPU is that? Some pre-ancient computer from 1960?
Regarding cache memory, those concerns may be more valid. It is possible that statically allocated memory utilizes data cache better, but that's just speculation and you'd have to have a very specific CPU in mind to have that discussion.
There is however a significant performance difference between static and dynamic allocation, and that is the allocation itself. For each call to malloc
there is a call to the OS API, which in turn runs search function going through the heap and looking for for a free segment. The library also needs to keep track of the address to that segment internally, so that when you call free()
it knows how much memory to release. Also, the more you call malloc/free, the more segmented the heap will become.
One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536]
and char *data_p = malloc(65536)
If the access is sporadic the static/global will be slightly faster:
// slower because we must fetch data_p and then store
void
datasetp(int idx,char val)
{
data_p[idx] = val;
}
// faster because we can store directly
void
datasets(int idx,char val)
{
data_s[idx] = val;
}
Now, if we consider caching, datasetp
and datasets
will be about the same [after the first access], because the fetch of data_p
will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.
However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p
at the start of the loop and put it in a register:
void
datasetalls(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_s[idx] = val;
}
void
datasetallp(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_p[idx] = val;
}
void
datasetallp_optimized(char val)
{
int idx;
register char *reg;
// the optimizer will generate the equivalent code to this
reg = data_p;
for (idx = 0; idx < 65536; ++idx)
reg[idx] = val;
}
If the access is so sporadic that data_p
gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.
If such eviction occurs, the actual data array will, most likely, be evicted as well.
A much larger array might have more of an effect (e.g. if instead of 65536
, we had 100000000
, then mere traversal will evict data_p
and by the time we reached the end of the array, the leftmost entries would already be evicted.
But, in that case, the extra fetch of data_p
would be 0.000001% overhead.
So, it helps to either benchmark [or model] the particular use case/access pattern.
UPDATE:
Based on some further experimentation [triggered by a comment by Peter], the datasetallp
function does not optimize to the equivalent of datasetallp_optimized
for certain conditions, due to strict aliasing considerations.
Because the arrays are char
[or unsigned char
], the compiler generates a data_p
fetch on each loop iteration. Note that if the arrays are not char
(e.g. int
), the optimization does occur and data_p
is fetched only once, because char
can alias anything but int
is more limited.
If we change char *data_p
to char *restrict data_p
we get the optimized behavior. Adding restrict
tells the compiler that data_p
will not alias anything [even itself], so it's safe to optimize the fetch.
Personal note: While I understand this, to me, it seems goofy that without restrict
, the compiler must assume that data_p
can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p
pointing to itself or that data_p
is part of a struct:
// simplest
char *data_p = malloc(65536);
data_p = (void *) &data_p;
// closer to real world
struct mystruct {
...
char *data_p;
...
};
struct mystruct mystruct;
mystruct.data_p = (void *) &mystruct;
These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].
For myself, I always hand code the equivalent of datasetallp_optimized
[sans register
], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p
only once".
Also, the multifetch problem does not occur when using data_p
for input [because we're not modifying anything, aliasing is not a consideration]:
// does fetch of data_p once at loop start
int
datasumallp(void)
{
int idx;
int sum;
sum = 0;
for (idx = 0; idx < 65536; ++idx)
sum += data_p[idx];
return sum;
}
But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s
or data_p
] is often less useful than passing the array address as an argument.
Side note: clang
would optimize some of the functions using data_s
into memset
calls, so, during experimentation, I modified the example code slightly to prevent this.
void
dataincallx(array_t *data,int val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data[idx] = val + idx;
}
This does not suffer from the multifetch problem. That is, dataincallx(data_s,17)
and dataincallx(data_p,37)
work about the same [with the initial extra data_p
fetch]. This is more likely to be what one might use in general [better code reuse, etc].
So, the distinction between data_s
and data_p
becomes a bit more of a moot point. Coupled with judicious use of restrict
[or using types other than char
], the data_p
fetch overhead can be minimized to the point where it isn't really that much of a consideration.
It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.
This is use case dependent.
If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.
However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s
or data_p
] directly.
I think that data locality is much more of an issue than computing the base address of the array. (I could imagine cases where accessing the pointer contents is extremely fast because it is in a register while the offset to the stack pointer or text segment is a compile time constant; accessing a register may be faster.)
But the real issue will be data locality, which is often a reason to be careful with dynamic memory in performance critical tight loops. If you have more dynamically allocated data which happens to be close to your array, chances are the memory will remain in the cache. If you have data scattered all over the RAM allocated at different times, you may have many cache misses accessing them. In that case it would be better to allocate them statically (or on the stack) next to each other, if possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With