Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C/C++ optimize data structures, array of arrays or just array

Tags:

c++

arrays

c

Working with a program that uses 16bytes 4v4 one byte matrices :

unsigned char matrix[4][4];

and a few 256bytes 16v16 one byte matrices:

unsigned char bigMatrix[16][16];

Very often due to data manipulation I am forced to loop column wise in the program making cache misses.

Will the performance improve if I use an array instead, i.e.

unsigned char matrix[16];
unsigned char matrix[256];

and access the elements by using some variables to retrieve elements, i.e.

matrix[variableA*variableB + i];

where variableA*variableB+i needs to be recalculated every time I want to access an element.

I only want speed optimization and memory is of no problem. Will this help, as in give some performance hit or loss, or is the difference too small to even care ?

like image 301
Milan Avatar asked Nov 27 '22 16:11

Milan


2 Answers

It makes no difference. The data is laid out in the exact same way in either case, and is accessed in the same way too. I'd be surprised if it didn't generate exactly the same assembly, even.

However, with a 256byte table, you're unlikely to get cache misses in any case. The CPU's L1 cache is typically between 32 and 128KB, so I doubt you're getting many cache misses in any case.

like image 67
jalf Avatar answered Dec 18 '22 09:12

jalf


jalf is mostly right. L1 cache is divided up into chunks, the size of the chunks is dependant on the processor but is in the order of 32 bytes. So, if you were stepping through memory a byte at a time, you'd get a cache miss every 32 bytes (or whatever the chunk size is). Now, the Intel chip is quite clever here and can detect sequential reads and prefetch the data reducing the impact of a cache miss.

The 4x4 matrix is highly likely to reside in a single L1 chunk (or cache line), so accessing it by row or by column makes little difference. Of course, you don't want to split the matrix across two cache lines, so well aligned memory is important.

The 16x16 matrix, however, won't fit in a cache line. So, if you're skipping through the array processing columns you're going to get lots of cache misses. The computation of the index, as jalf said, makes little difference as the ratio between CPU and memory is high (i.e. you can do a lot of CPU work for each cache miss).

Now, if you are mainly processing the matrix in a column orientated way, then your best option is to transpose all your matrices (swap rows with columns), thus your memory accesses will be more sequential and the number of cache misses will be reduced and the CPU will be able to prefetch data better. So, instead of organising the matrix like this:

  0   1   2 .... 15
 16  17  18 .... 31
....
240 241 242 .... 255

where the number is the memory offset from the start of the matrix, organise thus:

 0 16 32 ... 240
 1 17 33 ... 241
 ...
15 31 47 ... 255
like image 40
Skizz Avatar answered Dec 18 '22 07:12

Skizz