This code transposes a matrix four ways. The first does sequential writes, non sequential reads. The second is the opposite. The next two are the same, but with cache skipping writes. What seems to happen is sequential writes are faster, and skipping the cache is faster. What I don't understand is, if the cache is being skipped why are sequential writes still faster?
QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 1:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
for (i = 0; i < N; ++i)
tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 2:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
_mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 3:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
for (i = 0; i < N; ++i)
_mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 4:\t%ld\n", after.QuadPart - before.QuadPart);
EDIT: The output is
Transpose 1: 47603
Transpose 2: 92449
Transpose 3: 38340
Transpose 4: 69597
CPU has a write combining buffer to combine writes on a cache line to happen in one burst. In this case (cache being skipped for sequential writes), this write combining buffer acts as a one line cache which makes the results be very similar to cache not being skipped.
To be exact, in case of cache being skipped, writes are still happening in bursts to memory.
See write-combining logic behavior here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With