Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matrix computations in C

Tags:

c

memory

matrix

I recently noticed that seemingly small changes in how a matrix is accessed in C can have a big impact on performance. For example, let's imagine we have these two fragments of C code. This one:

for(i = 0; i < 2048; i++)
{
    for(j = 0; j < 2048; j++) {
            Matrix[i][j] = 9999;    
    }
}

And this one:

for(j = 0; j < 2048; j++)
{
    for(i = 0; i < 2048; i++) {
            Matrix[i][j] = 9999;    
    }
}

The second version is 2x slower than the first version. Why? I think it has to do with memory management: in each loop, the first version accesses positions in the memory that are located next to each other, while the second version has to "jump" to different regions in each loop. Is this intuition right? Also, if I make the matrix small (64x64 for example), then there is no difference in performance. Why? I would appreciate if someone could provide an intuitive and rigorous explanation. I am using Ubuntu 14.04 LTS, by the way.

like image 247
Peter Avatar asked Dec 13 '22 20:12

Peter


2 Answers

        for(i=0;i<2048;i++)
        {
                for(j=0;j<2048;j++) {
                        Matrix[i][j]=9999;    
                }
        }

This form uses the L1, L2 and L3 cache alignment. When you loop with j over Matrix[i][j], the elements Matrix[i][0], Matrix[i][1]...a.s.o. are aligned at consecutive addresses (actually at addresses differing by sizeof(Matrix[i][0])), so an access of Matrix[i][0] brings in the cache memory the next variable Matrix[i][1].

On the other hand,

        for(j=0;j<2048;j++)
        {
                for(i=0;i<2048;i++) {
                        Matrix[i][j]=9999;    
                }
        }

the inner loop accesses in the order Matrix[0][j], Matrix[1][j]... a.s.o. The address of Matrix[1][j] is Matrix[0][j]+2048*sizeof(Matrix[0][0]) -- supposing that you allocated 2048 entries for the array Matrix[0].

So Matrix[0][j] is in another block of cache than Matrix[1][j], requiring the fetch to make an access in RAM instead in cache.

In this second case there is an access to RAM at each iteration.

like image 111
alinsoar Avatar answered Dec 28 '22 01:12

alinsoar


"It's the cache! It's the cache!"

To visualise it, think about memory as a linear array...

By defining a 2D array:

uint8_t Matrix[4][4]

You are simply saying:

allocate 16 bytes, and access them as a 2D array, 4x4

This example assumes a 4-byte cache, to make things simple:

2D array in memory

If the CPU's cache is only able to hold 4 bytes, then approaching the array in the [0][0], [1][0], [2][0], ... form will cause a cache miss on every access - requiring us to access the RAM (which is expensive) 16 times!

Approaching the array in the [0][0], [0][1], [0][2], ... form will permit the 2D array to be accessed, in full, with only 4 cache misses.


This example is very simplified - modern systems will almost definately have an L1 and L2 cache, and many are now implementing an L3 cache too.

As you get further from the processor core, the memory gets larger and slower. For example:

  1. L1 cache (small, very very fast)
  2. L2 cache
  3. L3 cache(?)
  4. RAM
  5. Persistent storage (e.g: HDD - huge, but relatively, very, very slow).
like image 31
Attie Avatar answered Dec 28 '22 01:12

Attie