I have a matrix multiply code that looks like this: <pre class="prettyprint"><code>for(i = 0; i < dimension; i++) for(j = 0; j < dimension; j++) for(k = 0; k < dimension; k++) C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j]; </code></pre> Here, the size of the matrix is represented by <code>dimension</code>. Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens. Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0 Program compiled as <code>gcc -O3 simple.c</code> I have run this on Intel's icc compiler as well, and seen similar results. EDIT: As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds. Heres the complete program: <pre class="prettyprint"><code>#include <stdlib.h> #include <stdio.h> #include <sys/time.h> /* change dimension size as needed */ const int dimension = 2048; struct timeval tv; double timestamp() { double t; gettimeofday(&tv, NULL); t = tv.tv_sec + (tv.tv_usec/1000000.0); return t; } int main(int argc, char *argv[]) { int i, j, k; double *A, *B, *C, start, end; A = (double*)malloc(dimension*dimension*sizeof(double)); B = (double*)malloc(dimension*dimension*sizeof(double)); C = (double*)malloc(dimension*dimension*sizeof(double)); srand(292); for(i = 0; i < dimension; i++) for(j = 0; j < dimension; j++) { A[dimension*i+j] = (rand()/(RAND_MAX + 1.0)); B[dimension*i+j] = (rand()/(RAND_MAX + 1.0)); C[dimension*i+j] = 0.0; } start = timestamp(); for(i = 0; i < dimension; i++) for(j = 0; j < dimension; j++) for(k = 0; k < dimension; k++) C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j]; end = timestamp(); printf("\nsecs:%f\n", end-start); free(A); free(B); free(C); return 0; } </code></pre>

Here's my wild guess: cache It could be that you can fit 2 rows of 2000 <code>double</code>s into the cache. Which is slighly less than the 32kb L1 cache. (while leaving room other necessary things) But when you bump it up to 2048, it uses the entire cache (and you spill some because you need room for other things) Assuming the cache policy is LRU, spilling the cache just a tiny bit will cause the entire row to be repeatedly flushed and reloaded into the L1 cache. The other possibility is cache associativity due to the power-of-two. Though I think that processor is 2-way L1 associative so I don't think it matters in this case. (but I'll throw the idea out there anyway) Possible Explanation 2: Conflict cache misses due to super-alignment on the L2 cache. Your <code>B</code> array is being iterated on the column. So the access is strided. Your total data size is <code>2k x 2k</code> which is about 32 MB per matrix. That's much larger than your L2 cache. When the data is not aligned perfectly, you will have decent spatial locality on B. Although you are hopping rows and only using one element per cacheline, the cacheline stays in the L2 cache to be reused by the next iteration of the middle loop. However, when the data is aligned perfectly (2048), these hops will all land on the same "cache way" and will far exceed your L2 cache associativity. Therefore, the accessed cache lines of <code>B</code> will not stay in cache for the next iteration. Instead, they will need to be pulled in all the way from ram.

Matrix multiplication: Small difference in matrix size, large difference in timings

Tags:

performance

c

algorithm

matrix-multiplication

I have a matrix multiply code that looks like this:

for(i = 0; i < dimension; i++)     for(j = 0; j < dimension; j++)         for(k = 0; k < dimension; k++)             C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Here, the size of the matrix is represented by dimension. Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens.

Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0

Program compiled as gcc -O3 simple.c

I have run this on Intel's icc compiler as well, and seen similar results.

EDIT:

As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds.

Heres the complete program:

#include <stdlib.h> #include <stdio.h> #include <sys/time.h>  /* change dimension size as needed */ const int dimension = 2048; struct timeval tv;   double timestamp() {         double t;         gettimeofday(&tv, NULL);         t = tv.tv_sec + (tv.tv_usec/1000000.0);         return t; }  int main(int argc, char *argv[]) {         int i, j, k;         double *A, *B, *C, start, end;          A = (double*)malloc(dimension*dimension*sizeof(double));         B = (double*)malloc(dimension*dimension*sizeof(double));         C = (double*)malloc(dimension*dimension*sizeof(double));          srand(292);          for(i = 0; i < dimension; i++)                 for(j = 0; j < dimension; j++)                 {                            A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));                         B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));                         C[dimension*i+j] = 0.0;                 }             start = timestamp();         for(i = 0; i < dimension; i++)                 for(j = 0; j < dimension; j++)                         for(k = 0; k < dimension; k++)                                 C[dimension*i+j] += A[dimension*i+k] *                                         B[dimension*k+j];          end = timestamp();         printf("\nsecs:%f\n", end-start);          free(A);         free(B);         free(C);          return 0; }

592

asked Oct 26 '11 16:10

jitihsk

1 Answers

Here's my wild guess: cache

It could be that you can fit 2 rows of 2000 doubles into the cache. Which is slighly less than the 32kb L1 cache. (while leaving room other necessary things)

But when you bump it up to 2048, it uses the entire cache (and you spill some because you need room for other things)

Assuming the cache policy is LRU, spilling the cache just a tiny bit will cause the entire row to be repeatedly flushed and reloaded into the L1 cache.

The other possibility is cache associativity due to the power-of-two. Though I think that processor is 2-way L1 associative so I don't think it matters in this case. (but I'll throw the idea out there anyway)

Possible Explanation 2: Conflict cache misses due to super-alignment on the L2 cache.

Your B array is being iterated on the column. So the access is strided. Your total data size is 2k x 2k which is about 32 MB per matrix. That's much larger than your L2 cache.

When the data is not aligned perfectly, you will have decent spatial locality on B. Although you are hopping rows and only using one element per cacheline, the cacheline stays in the L2 cache to be reused by the next iteration of the middle loop.

However, when the data is aligned perfectly (2048), these hops will all land on the same "cache way" and will far exceed your L2 cache associativity. Therefore, the accessed cache lines of B will not stay in cache for the next iteration. Instead, they will need to be pulled in all the way from ram.

168

answered Sep 17 '22 03:09

Mysticial

Related questions
                            
                                How do I show what fields a struct has in GDB?
                            
                                Best C/C++ Network Library
                            
                                What are the most useful new features in C99? [closed]
                            
                                Container Class / Library for C [closed]
                            
                                Understanding the behavior of C's preprocessor when a macro indirectly expands itself
                            
                                Is there something like the official C documentation? [closed]
                            
                                How to Add Linux Executable Files to .gitignore?
                            
                                What is the use of .exp and what is the difference between .lib and .dll?
                            
                                Is Google Test OK for testing C code?
                            
                                pthreads mutex vs semaphore
                            
                                Why FolderBrowserDialog dialog does not scroll to selected folder?
                            
                                Subtracting packed 8-bit integers in an 64-bit integer by 1 in parallel, SWAR without hardware SIMD
                            
                                Take the address of a one-past-the-end array element via subscript: legal by the C++ Standard or not?
                            
                                How to convert const char* to char* in C?
                            
                                How do you implement a circular buffer in C?
                            
                                How to concatenate string and int in C?
                            
                                Does the order of members in a struct matter?
                            
                                Is accessing data in the heap faster than from the stack?
                            
                                Compile and run program without main() in C
                            
                                Is it possible to call a C function from C#.Net

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With