The following example code generates a matrix of size N
, and transposes it SAMPLES
number of times.
When N = 512
the average execution time of the transposition operation is 2144 μs
(coliru link).
At first look there is nothing special, right?...
Well, here are the results for
N = 513
→ 1451 μs
N = 519
→ 600 μs
N = 530
→ 486 μs
N = 540
→ 492 μs
(finally! theory starts working :).So why in practice are these simple calculations so different from theory? Does this behavior relate to CPU cache coherence or cache misses? If so please explain.
#include <algorithm>
#include <iostream>
#include <chrono>
constexpr int N = 512; // Why is 512 specifically slower (as of 2016)
constexpr int SAMPLES = 1000;
using us = std::chrono::microseconds;
int A[N][N];
void transpose()
{
for ( int i = 0 ; i < N ; i++ )
for ( int j = 0 ; j < i ; j++ )
std::swap(A[i][j], A[j][i]);
}
int main()
{
// initialize matrix
for ( int i = 0 ; i < N ; i++ )
for ( int j = 0 ; j < N ; j++ )
A[i][j] = i+j;
auto t1 = std::chrono::system_clock::now();
for ( int i = 0 ; i < SAMPLES ; i++ )
transpose();
auto t2 = std::chrono::system_clock::now();
std::cout << "Average for size " << N << ": " << std::chrono::duration_cast<us>(t2 - t1).count() / SAMPLES << " (us)";
}
Another common operation applied to a matrix is known as the transpose of the matrix, or in mathematical terms, AT . The transpose is defined for matrices of any size and flips all elements along the main diagonal, inverting the columns and rows. For instance, a 4×3 4 × 3 matrix would become a 3×4 3 × 4 matrix.
The transpose of a matrix is obtained by changing the rows into columns and columns into rows for a given matrix. It is especially useful in applications where inverse and adjoint of matrices are to be obtained.
If A is an m × n matrix and AT is its transpose, then the result of matrix multiplication with these two matrices gives two square matrices: A AT is m × m and AT A is n × n. Furthermore, these products are symmetric matrices.
The product of any matrix (square or rectangular) and it's transpose is always symmetric.
This is due cache misses. You can use valgrind --tool=cachegrind
to see the amount of misses. Using N = 512
you got the following output:
Average for size 512: 13052 (us)==21803==
==21803== I refs: 1,054,721,935
==21803== I1 misses: 1,640
==21803== LLi misses: 1,550
==21803== I1 miss rate: 0.00%
==21803== LLi miss rate: 0.00%
==21803==
==21803== D refs: 524,278,606 (262,185,156 rd + 262,093,450 wr)
==21803== D1 misses: 139,388,226 (139,369,492 rd + 18,734 wr)
==21803== LLd misses: 25,828 ( 7,959 rd + 17,869 wr)
==21803== D1 miss rate: 26.6% ( 53.2% + 0.0% )
==21803== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==21803==
==21803== LL refs: 139,389,866 (139,371,132 rd + 18,734 wr)
==21803== LL misses: 27,378 ( 9,509 rd + 17,869 wr)
==21803== LL miss rate: 0.0% ( 0.0% + 0.0% )
While, using N=530
you got the following output:
Average for size 530: 13264 (us)==22783==
==22783== I refs: 1,129,929,859
==22783== I1 misses: 1,640
==22783== LLi misses: 1,550
==22783== I1 miss rate: 0.00%
==22783== LLi miss rate: 0.00%
==22783==
==22783== D refs: 561,773,362 (280,923,156 rd + 280,850,206 wr)
==22783== D1 misses: 32,899,398 ( 32,879,492 rd + 19,906 wr)
==22783== LLd misses: 26,999 ( 7,958 rd + 19,041 wr)
==22783== D1 miss rate: 5.9% ( 11.7% + 0.0% )
==22783== LLd miss rate: 0.0% ( 0.0% + 0.0% )
==22783==
==22783== LL refs: 32,901,038 ( 32,881,132 rd + 19,906 wr)
==22783== LL misses: 28,549 ( 9,508 rd + 19,041 wr)
==22783== LL miss rate: 0.0% ( 0.0% + 0.0% )
As you can see, the D1 misses in 512 is around 3.5 times bigger than in 530
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With