Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cache utilization in matrix transpose in c

This code transposes a matrix four ways. The first does sequential writes, non sequential reads. The second is the opposite. The next two are the same, but with cache skipping writes. What seems to happen is sequential writes are faster, and skipping the cache is faster. What I don't understand is, if the cache is being skipped why are sequential writes still faster?

QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
   for (j = 0; j < N; ++j)
      tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 1:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
   for (i = 0; i < N; ++i)
     tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 2:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
   for (j = 0; j < N; ++j)
      _mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 3:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
   for (i = 0; i < N; ++i)
      _mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 4:\t%ld\n", after.QuadPart - before.QuadPart);

EDIT: The output is

Transpose 1:    47603
Transpose 2:    92449
Transpose 3:    38340
Transpose 4:    69597
like image 842
Dan Avatar asked Oct 10 '22 05:10

Dan


1 Answers

CPU has a write combining buffer to combine writes on a cache line to happen in one burst. In this case (cache being skipped for sequential writes), this write combining buffer acts as a one line cache which makes the results be very similar to cache not being skipped.

To be exact, in case of cache being skipped, writes are still happening in bursts to memory.

See write-combining logic behavior here.

like image 67
Kamyar Souri Avatar answered Oct 13 '22 11:10

Kamyar Souri