Why are these matrix transposition times so counter-intuitive?

Tags:

The following example code generates a matrix of size N, and transposes it SAMPLES number of times. When N = 512 the average execution time of the transposition operation is 2144 μs (coliru link). At first look there is nothing special, right?...

Well, here are the results for

N = 513 → 1451 μs
N = 519 → 600 μs
N = 530 → 486 μs
N = 540 → 492 μs (finally! theory starts working :).

So why in practice are these simple calculations so different from theory? Does this behavior relate to CPU cache coherence or cache misses? If so please explain.

#include <algorithm>
#include <iostream>
#include <chrono>

constexpr int N       = 512; // Why is 512 specifically slower (as of 2016)
constexpr int SAMPLES = 1000;
using us = std::chrono::microseconds;

int A[N][N];

void transpose()
{
    for ( int i = 0 ; i < N ; i++ )
    for ( int j = 0 ; j < i ; j++ )
        std::swap(A[i][j], A[j][i]);
}

int main()
{
    // initialize matrix
    for ( int i = 0 ; i < N ; i++ )
    for ( int j = 0 ; j < N ; j++ )
        A[i][j] = i+j;

    auto t1 = std::chrono::system_clock::now();
    for ( int i = 0 ; i < SAMPLES ; i++ )
        transpose();
    auto t2 = std::chrono::system_clock::now();

    std::cout << "Average for size " << N << ": " << std::chrono::duration_cast<us>(t2 - t1).count() / SAMPLES << " (us)"; 
}

966

asked Mar 02 '17 20:03

Narek Atayan

1 Answers

This is due cache misses. You can use valgrind --tool=cachegrind to see the amount of misses. Using N = 512 you got the following output:

Average for size 512: 13052 (us)==21803== 
==21803== I   refs:      1,054,721,935
==21803== I1  misses:            1,640
==21803== LLi misses:            1,550
==21803== I1  miss rate:          0.00%
==21803== LLi miss rate:          0.00%
==21803== 
==21803== D   refs:        524,278,606  (262,185,156 rd   + 262,093,450 wr)
==21803== D1  misses:      139,388,226  (139,369,492 rd   +      18,734 wr)
==21803== LLd misses:           25,828  (      7,959 rd   +      17,869 wr)
==21803== D1  miss rate:          26.6% (       53.2%     +         0.0%  )
==21803== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==21803== 
==21803== LL refs:         139,389,866  (139,371,132 rd   +      18,734 wr)
==21803== LL misses:            27,378  (      9,509 rd   +      17,869 wr)
==21803== LL miss rate:            0.0% (        0.0%     +         0.0%  )

While, using N=530 you got the following output:

Average for size 530: 13264 (us)==22783== 
==22783== I   refs:      1,129,929,859
==22783== I1  misses:            1,640
==22783== LLi misses:            1,550
==22783== I1  miss rate:          0.00%
==22783== LLi miss rate:          0.00%
==22783== 
==22783== D   refs:        561,773,362  (280,923,156 rd   + 280,850,206 wr)
==22783== D1  misses:       32,899,398  ( 32,879,492 rd   +      19,906 wr)
==22783== LLd misses:           26,999  (      7,958 rd   +      19,041 wr)
==22783== D1  miss rate:           5.9% (       11.7%     +         0.0%  )
==22783== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==22783== 
==22783== LL refs:          32,901,038  ( 32,881,132 rd   +      19,906 wr)
==22783== LL misses:            28,549  (      9,508 rd   +      19,041 wr)
==22783== LL miss rate:            0.0% (        0.0%     +         0.0%  )

As you can see, the D1 misses in 512 is around 3.5 times bigger than in 530

answered Oct 16 '22 00:10

Amadeus

Related questions
                            
                                Sink arguments and move semantics for functions that can fail (strong exception safety)
                            
                                How come you can't have multiple decl-specifier-seq's in a declaration?
                            
                                Extern function pointer declaration and type inferred definition
                            
                                In C++, What does "access" mean in the strict aliasing rule?
                            
                                Why does constexpr static member (of type class) require a definition?
                            
                                QCustomPlot - show item on QCPAxisRect below customPlot
                            
                                Logging from library with spdlog
                            
                                OpenCV stitch images by warping both
                            
                                Using reference-to-lambda as comparator in a map (the right way?)
                            
                                Are elements added to a std::map automatically initialised?
                            
                                log10() performance on Visual Studio 2015 a lot slower than Visual Studio 2013 for x86
                            
                                Why are implicitly and explicitly deleted move constructors treated differently?
                            
                                Unions in C++11: default constructor seems to be deleted
                            
                                Can an lvalue reference non-type template parameter be inferred?
                            
                                Do array elements count as a common initial sequence?
                            
                                Running Ascii regex over non-ASCII characters with UTF-8
                            
                                R: Error in dyn.load(file, DLLpath = DLLpath, ...)
                            
                                can a static constexpr variable be used as a template argument
                            
                                Solutions for dynamic dispatch on unrelated types
                            
                                Inheriting from std::basic_streambuf to write to a socket

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why are these matrix transposition times so counter-intuitive?

Tags:

c++

performance

matrix

cpu-cache

low-latency

Narek Atayan

People also ask

1 Answers

Amadeus

Recent Activity

Donate For Us