Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

L2 data and instruction cache decreased suddenly

I am working on performance of paralle algorithm on Multicore machine. I did an experiment on Matrix Multiplication with loop reordering (ikj) technique.

The serial execution result is as in images below.L1 data cache hit for loop order ikj and kij for all size of nXn matrix is near 100% (Image 1 box number 1 & 2) and as you can see loop order ikj in size 2048 and 4096 suddenly L2 data cach hit decrease by %50 (Image 2 box number 1 & 2) also in L2 instruction cache hit the same is true. Incase where L1 data cache hit for these 2 size are like other sizes(256,512,1024) is about %100. I could not find any resonable reason for this slope in both Instruction and data cache hit. could any one give me clue on how to find the reason(s)?

do you think that L2 unified cache has any effect on exacerbating the problem? But still what causes this reduction what characteristic of algorithm and performance should I profile to find reason.

experimental machine is Intel e4500 with 2Mb L2 cache,cache line 64, os is fedora 17 x64 with gcc 4.7 -o no compiler optimization

Abridged & Complete Question? my problem is that why sudden decrease of about 50% in both L2 data and instruction cache happens in only ikj & kij algorithm as it's boxed and numbered 1 & 2 in images, but not in other loop variation?

                                   *Image 1*

enter image description here

                                    *Image 2*

enter image description here

                                    *Image 3*

enter image description here

                                   *Image 4*

enter image description here

                                   *Image 5*

Despite the aformentioned problem there is no increase in timing of ikj&kij algorithm. But also is faster than others. enter image description here

ikj and kij algorithm are two variation of loop reordering technique/

kij Algorithm

   For (k=0;k<n;k++)
    For(i=0;i<n;i++){
        r=A[i][k];
      For (j=0;j<n;j++)
          C[i][j]+=r*B[k][j] 
    }

ikj Algorithm

For (i=0;i<n;i++)
     For(k=0;k<n;k++){
      r=A[i][k];
      For (j=0;j<n;j++)
           C[i][j]+=r*B[k][j] 
    }   

thanks

like image 888
mjr Avatar asked Oct 12 '13 13:10

mjr


1 Answers

I bet this happens because of the super-alignment issue discussed in the answer of following questions:

  • Why is my program slow when looping over exactly 8192 elements?
  • Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
  • Matrix multiplication: Small difference in matrix size, large difference in timings

I hope it is understandable that I don't like to copy&paste from those answers.

like image 75
villekulla Avatar answered Nov 05 '22 07:11

villekulla