Cache utilization in matrix transpose in c

Question

This code transposes a matrix four ways. The first does sequential writes, non sequential reads. The second is the opposite. The next two are the same, but with cache skipping writes. What seems to happen is sequential writes are faster, and skipping the cache is faster. What I don't understand is, if the cache is being skipped why are sequential writes still faster?

QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
   for (j = 0; j < N; ++j)
      tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 1:	%ld
", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
   for (i = 0; i < N; ++i)
     tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 2:	%ld
", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
   for (j = 0; j < N; ++j)
      _mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 3:	%ld
", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
   for (i = 0; i < N; ++i)
      _mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 4:	%ld
", after.QuadPart - before.QuadPart);

EDIT: The output is

Transpose 1:    47603
Transpose 2:    92449
Transpose 3:    38340
Transpose 4:    69597

Kamyar Souri · Accepted Answer

CPU has a write combining buffer to combine writes on a cache line to happen in one burst. In this case (cache being skipped for sequential writes), this write combining buffer acts as a one line cache which makes the results be very similar to cache not being skipped.

To be exact, in case of cache being skipped, writes are still happening in bursts to memory.

See write-combining logic behavior here.

To be exact, in case of cache being skipped, writes are still happening in bursts to memory.

See write-combining logic behavior here.

Cache utilization in matrix transpose in c

Tags:

c

caching

computer-architecture

Dan

1 Answers

Kamyar Souri

Recent Activity

Donate For Us

Cache utilization in matrix transpose in c

Tags:

c

caching

computer-architecture

Dan

1 Answers

Kamyar Souri

Related questions

Recent Activity

Donate For Us