I recently noticed that seemingly small changes in how a matrix is accessed in C can have a big impact on performance. For example, let's imagine we have these two fragments of C code. This one:
for(i = 0; i < 2048; i++)
{
for(j = 0; j < 2048; j++) {
Matrix[i][j] = 9999;
}
}
And this one:
for(j = 0; j < 2048; j++)
{
for(i = 0; i < 2048; i++) {
Matrix[i][j] = 9999;
}
}
The second version is 2x slower than the first version. Why? I think it has to do with memory management: in each loop, the first version accesses positions in the memory that are located next to each other, while the second version has to "jump" to different regions in each loop. Is this intuition right? Also, if I make the matrix small (64x64 for example), then there is no difference in performance. Why? I would appreciate if someone could provide an intuitive and rigorous explanation. I am using Ubuntu 14.04 LTS, by the way.
for(i=0;i<2048;i++)
{
for(j=0;j<2048;j++) {
Matrix[i][j]=9999;
}
}
This form uses the L1, L2 and L3 cache alignment. When you loop with j
over Matrix[i][j]
, the elements Matrix[i][0]
, Matrix[i][1]
...a.s.o. are aligned at consecutive addresses (actually at addresses differing by sizeof(Matrix[i][0]))
, so an access of Matrix[i][0]
brings in the cache memory the next variable Matrix[i][1].
On the other hand,
for(j=0;j<2048;j++)
{
for(i=0;i<2048;i++) {
Matrix[i][j]=9999;
}
}
the inner loop accesses in the order Matrix[0][j]
, Matrix[1][j]
... a.s.o. The address of Matrix[1][j]
is Matrix[0][j]+2048*sizeof(Matrix[0][0])
-- supposing that you allocated 2048 entries for the array Matrix[0]
.
So Matrix[0][j]
is in another block of cache than Matrix[1][j]
, requiring the fetch to make an access in RAM instead in cache.
In this second case there is an access to RAM at each iteration.
"It's the cache! It's the cache!"
To visualise it, think about memory as a linear array...
By defining a 2D array:
uint8_t Matrix[4][4]
You are simply saying:
allocate 16 bytes, and access them as a 2D array, 4x4
This example assumes a 4-byte cache, to make things simple:
If the CPU's cache is only able to hold 4 bytes, then approaching the array in the [0][0]
, [1][0]
, [2][0]
, ... form will cause a cache miss on every access - requiring us to access the RAM (which is expensive) 16 times!
Approaching the array in the [0][0]
, [0][1]
, [0][2]
, ... form will permit the 2D array to be accessed, in full, with only 4 cache misses.
This example is very simplified - modern systems will almost definately have an L1 and L2 cache, and many are now implementing an L3 cache too.
As you get further from the processor core, the memory gets larger and slower. For example:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With