Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

compiler nested loop optimization for sequential memory access.

I came across a strange performance issue in a matrix multiply benchmark (matrix_mult in Metis from the MOSBENCH suite). The benchmark was optimized to tile the data such that the active working set was 12kb (3 tiles of 32x32 ints) and would fit into the L1 cache. To make a long story short, swapping the inner two most loops had a performance difference of almost 4x on certain array input sizes (4096, 8192) and about a 30% difference on others. The problem essentially came down to accessing elements sequentially versus in a stride pattern. Certain array sizes I think created a bad stride access that generated a lot cache line collisions. The performance difference is noticeably less when changing from 2-way associative L1 to an 8-way associative L1.

My question is why doesn't gcc optimize loop ordering to maximize sequential memory accesses?

Below is a simplified version of the problem (note that performance times are highly dependent on L1 configuration. The numbers indicated below are from 2.3 GHZ AMD system with 64K L1 2-way associative compiled with -O3).

N = ARRAY_SIZE // 1024
int* mat_A = (int*)malloc(N*N*sizeof(int));
int* mat_B = (int*)malloc(N*N*sizeof(int));
int* mat_C = (int*)malloc(N*N*sizeof(int));

// Elements of mat_B are accessed in a stride pattern of length N
// This takes 800 msec  
for (int t = 0; t < 1000; t++) 
   for (int a = 0; a < 32; a++) 
      for (int b = 0; b < 32; b++)
         for (int c = 0; c < 32; c++) 
            mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];

// Inner two loops are swapped
// Elements are now accessed sequentially in inner loop
// This takes 172 msec  
for (int t = 0; t < 1000; t++) 
   for (int a = 0; a < 32; a++) 
      for (int c = 0; c < 32; c++) 
         for (int b = 0; b < 32; b++)
            mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
like image 856
Mark Avatar asked Nov 13 '22 09:11

Mark


1 Answers

  1. gcc might not be able to prove that the pointers don't overlap. If you are fine using non standard extensions you could try using __restrict.
  2. gcc doesn't take full advantage of your architecture to avoid the necessity to recompile for every processor. Using the option -march with the appropriate value for your system might help.
like image 101
taurdir Avatar answered Feb 15 '23 23:02

taurdir