I have a <code>std::vector<std::vector<double>></code> that I am trying to convert to a single contiguous vector as fast as possible. My vector has a shape of roughly <code>4000 x 50</code>. The problem is, sometimes I need my output vector in column-major contiguous order (just concatenating the interior vectors of my 2d input vector), and sometimes I need my output vector in row-major contiguous order, effectively requiring a transpose. I have found that a naive for loop is quite fast for conversion to a column-major vector: <pre class="prettyprint"><code>auto to_dense_column_major_naive(std::vector<std::vector<double>> const & vec) -> std::vector<double> { auto n_col = vec.size(); auto n_row = vec[0].size(); std::vector<double> out_vec(n_col * n_row); for (size_t i = 0; i < n_col; ++i) for (size_t j = 0; j < n_row; ++j) out_vec[i * n_row + j] = vec[i][j]; return out_vec; } </code></pre> But obviously a similar approach is very slow for row-wise conversion, because of all of the cache misses. So for row-wise conversion, I thought a blocking strategy to promote cache locality might be my best bet: <pre class="prettyprint"><code>auto to_dense_row_major_blocking(std::vector<std::vector<double>> const & vec) -> std::vector<double> { auto n_col = vec.size(); auto n_row = vec[0].size(); std::vector<double> out_vec(n_col * n_row); size_t block_side = 8; for (size_t l = 0; l < n_col; l += block_side) { for (size_t k = 0; k < n_row; k += block_side) { for (size_t j = l; j < l + block_side && j < n_col; ++j) { auto const &column = vec[j]; for (size_t i = k; i < k + block_side && i < n_row; ++i) out_vec[i * n_col + j] = column[i]; } } } return out_vec; } </code></pre> This is considerably faster than a naive loop for row-major conversion, but still almost an order of magnitude slower than naive column-major looping on my input size. <img src="https://i.stack.imgur.com/YfgCj.png" alt="enter image description here"> My question is, is there a faster approach to converting a (column-major) vector of vectors of doubles to a single contiguous row-major vector? I am struggling to reason about what the limit of speed of this code should be, and am thus questioning whether I'm missing something obvious. My assumption was that blocking would give me a much larger speedup then it appears to actually give. <hr> The chart was generated using QuickBench (and somewhat verified with GBench locally on my machine) with this code: (Clang 7, C++20, -O3) <pre class="prettyprint"><code>auto to_dense_column_major_naive(std::vector<std::vector<double>> const & vec) -> std::vector<double> { auto n_col = vec.size(); auto n_row = vec[0].size(); std::vector<double> out_vec(n_col * n_row); for (size_t i = 0; i < n_col; ++i) for (size_t j = 0; j < n_row; ++j) out_vec[i * n_row + j] = vec[i][j]; return out_vec; } auto to_dense_row_major_naive(std::vector<std::vector<double>> const & vec) -> std::vector<double> { auto n_col = vec.size(); auto n_row = vec[0].size(); std::vector<double> out_vec(n_col * n_row); for (size_t i = 0; i < n_col; ++i) for (size_t j = 0; j < n_row; ++j) out_vec[j * n_col + i] = vec[i][j]; return out_vec; } auto to_dense_row_major_blocking(std::vector<std::vector<double>> const & vec) -> std::vector<double> { auto n_col = vec.size(); auto n_row = vec[0].size(); std::vector<double> out_vec(n_col * n_row); size_t block_side = 8; for (size_t l = 0; l < n_col; l += block_side) { for (size_t k = 0; k < n_row; k += block_side) { for (size_t j = l; j < l + block_side && j < n_col; ++j) { auto const &column = vec[j]; for (size_t i = k; i < k + block_side && i < n_row; ++i) out_vec[i * n_col + j] = column[i]; } } } return out_vec; } auto to_dense_column_major_blocking(std::vector<std::vector<double>> const & vec) -> std::vector<double> { auto n_col = vec.size(); auto n_row = vec[0].size(); std::vector<double> out_vec(n_col * n_row); size_t block_side = 8; for (size_t l = 0; l < n_col; l += block_side) { for (size_t k = 0; k < n_row; k += block_side) { for (size_t j = l; j < l + block_side && j < n_col; ++j) { auto const &column = vec[j]; for (size_t i = k; i < k + block_side && i < n_row; ++i) out_vec[j * n_row + i] = column[i]; } } } return out_vec; } auto make_vecvec() -> std::vector<std::vector<double>> { std::vector<std::vector<double>> vecvec(50, std::vector<double>(4000)); std::mt19937 mersenne {2019}; std::uniform_real_distribution<double> dist(-1000, 1000); for (auto &vec: vecvec) for (auto &val: vec) val = dist(mersenne); return vecvec; } static void NaiveColumnMajor(benchmark::State& state) { // Code before the loop is not measured auto vecvec = make_vecvec(); for (auto _ : state) { benchmark::DoNotOptimize(to_dense_column_major_naive(vecvec)); } } BENCHMARK(NaiveColumnMajor); static void NaiveRowMajor(benchmark::State& state) { // Code before the loop is not measured auto vecvec = make_vecvec(); for (auto _ : state) { benchmark::DoNotOptimize(to_dense_row_major_naive(vecvec)); } } BENCHMARK(NaiveRowMajor); static void BlockingRowMajor(benchmark::State& state) { // Code before the loop is not measured auto vecvec = make_vecvec(); for (auto _ : state) { benchmark::DoNotOptimize(to_dense_row_major_blocking(vecvec)); } } BENCHMARK(BlockingRowMajor); static void BlockingColumnMajor(benchmark::State& state) { // Code before the loop is not measured auto vecvec = make_vecvec(); for (auto _ : state) { benchmark::DoNotOptimize(to_dense_column_major_blocking(vecvec)); } } BENCHMARK(BlockingColumnMajor); </code></pre>

First of all, I cringe whenever something is qualified as "obviously". That word is often used to cover up a shortcoming in one's deductions. <blockquote> But obviously a similar approach is very slow for row-wise conversion, because of all of the cache misses. </blockquote> I'm not sure which is supposed to be obvious: that the row-wise conversion would be slow, or that it's slow because of cache misses. In either case, I find it not obvious. After all, there are two caching considerations here, aren't there? One for reading and one for writing? Let's look at the code from the reading perspective: row_major_naive <pre class="prettyprint"><code>for (size_t i = 0; i < n_col; ++i) for (size_t j = 0; j < n_row; ++j) out_vec[j * n_col + i] = vec[i][j]; </code></pre> Successive reads from <code>vec</code> are reads of contiguous memory: <code>vec[i][0]</code> followed by <code>vec[i][1]</code>, etc. Very good for caching. So... cache misses? Slow? :) Maybe not so obvious. Still, there is something to be gleaned from this. The claim is only wrong by claiming "obviously". There are non-locality issues, but they occur on the writing end. (Successive writes are offset by the space for 50 <code>double</code> values.) And empirical testing confirms the slowness. So maybe a solution is to flip on what is considered "obvious"? row major flipped <pre class="prettyprint"><code>for (size_t j = 0; j < n_row; ++j) for (size_t i = 0; i < n_col; ++i) out_vec[j * n_col + i] = vec[i][j]; </code></pre> All I did here was reverse the loops. Literally swap the order of those two lines of code then adjust the indentation. Now successive reads are potentially all over the place, as they read from different vectors. However, successive writes are now to contiguous blocks of memory. In one sense, we are in the same situation as before. But just like before, one should measure performance before assuming "fast" or "slow". NaiveColumnMajor: 3.4 seconds NaiveRowMajor: 7.7 seconds FlippedRowMajor: 4.2 seconds BlockingRowMajor: 4.4 seconds BlockingColumnMajor: 3.9 seconds Still slower than the naive column major conversion. However, this approach is not only faster than naive row major, but it's also faster than blocking row major. At least on my computer (using <code>gcc -O3</code> and obviously :P iterating thousands of times). Mileage may vary. I don't know what the fancy profiling tools would say. The point is that sometimes simpler is better. For funsies I did a test where the dimensions are swapped (changing from 50 vectors of 4000 elements to 4000 vectors of 50 elements). All methods got hurt this way, but "NaiveRowMajor" took the biggest hit. Worth noting is that "flipped row major" fell behind the blocking version. So, as one might expect, the best tool for the job depends on what exactly the job is. NaiveColumnMajor: 3.7 seconds NaiveRowMajor: 16 seconds FlippedRowMajor: 5.6 seconds BlockingRowMajor: 4.9 seconds BlockingColumnMajor: 4.5 seconds <del>(By the way, I also tried the flipping trick on the blocking version. The change was small -- around 0.2 -- and opposite of flipping the naive version. That is, "flipped blocking" was slower than "blocking" for the question's 50-of-4000 vectors, but faster for my 4000-of-50 variant. Fine tuning might improve the results.)</del> <hr> Update: I did a little more testing with the flipping trick on the blocking version. This version has four loops, so "flipping" is not as straight-forward as when there are only two loops. It looks like swapping the order of the outer two loops is bad for performance, while swapping the inner two loops is good. (Initially, I had done both and gotten mixed results.) When I swapped just the inner loops, I measured 3.8 seconds (and 4.1 seconds in the 4000-of-50 scenario), making this the best row-major option in my tests. row major hybrid <pre class="prettyprint"><code>for (size_t l = 0; l < n_col; l += block_side) for (size_t i = 0; i < n_row; ++i) for (size_t j = l; j < l + block_side && j < n_col; ++j) out_vec[i * n_col + j] = vec[j][i]; </code></pre> (After swapping the inner loops, I merged the middle loops.) As for the theory behind this, I would guess that this amounts to trying to write one cache block at a time. Once a block is written, try to re-use vectors (the <code>vec[j]</code>) before they get ejected from the cache. After you exhaust those source vectors, move on to a new group of source vectors, again writing full blocks at a time.

Faster way to convert a vector of vectors to a single contiguous vector with opposite storage order

Tags:

c++

performance

caching

vector

I have a std::vector<std::vector<double>> that I am trying to convert to a single contiguous vector as fast as possible. My vector has a shape of roughly 4000 x 50.

The problem is, sometimes I need my output vector in column-major contiguous order (just concatenating the interior vectors of my 2d input vector), and sometimes I need my output vector in row-major contiguous order, effectively requiring a transpose.

I have found that a naive for loop is quite fast for conversion to a column-major vector:

auto to_dense_column_major_naive(std::vector<std::vector<double>> const & vec)
    -> std::vector<double>
{
    auto n_col = vec.size();
    auto n_row = vec[0].size();
    std::vector<double> out_vec(n_col * n_row);
    for (size_t i = 0; i < n_col; ++i)
        for (size_t j = 0; j < n_row; ++j)
            out_vec[i * n_row + j] = vec[i][j];
    return out_vec;
}

But obviously a similar approach is very slow for row-wise conversion, because of all of the cache misses. So for row-wise conversion, I thought a blocking strategy to promote cache locality might be my best bet:

auto to_dense_row_major_blocking(std::vector<std::vector<double>> const & vec)
    -> std::vector<double>
{
    auto n_col = vec.size();
    auto n_row = vec[0].size();
    std::vector<double> out_vec(n_col * n_row);
    size_t block_side = 8;

    for (size_t l = 0; l < n_col; l += block_side) {
        for (size_t k = 0; k < n_row; k += block_side) {
            for (size_t j = l; j < l + block_side && j < n_col; ++j) {
                auto const &column = vec[j];
                for (size_t i = k; i < k + block_side && i < n_row; ++i)
                    out_vec[i * n_col + j] = column[i];
            }
        }
    }
    return out_vec;
}

This is considerably faster than a naive loop for row-major conversion, but still almost an order of magnitude slower than naive column-major looping on my input size.

enter image description here

My question is, is there a faster approach to converting a (column-major) vector of vectors of doubles to a single contiguous row-major vector? I am struggling to reason about what the limit of speed of this code should be, and am thus questioning whether I'm missing something obvious. My assumption was that blocking would give me a much larger speedup then it appears to actually give.

The chart was generated using QuickBench (and somewhat verified with GBench locally on my machine) with this code: (Clang 7, C++20, -O3)

auto to_dense_column_major_naive(std::vector<std::vector<double>> const & vec)
    -> std::vector<double>
{
    auto n_col = vec.size();
    auto n_row = vec[0].size();
    std::vector<double> out_vec(n_col * n_row);
    for (size_t i = 0; i < n_col; ++i)
        for (size_t j = 0; j < n_row; ++j)
            out_vec[i * n_row + j] = vec[i][j];
    return out_vec;
}

auto to_dense_row_major_naive(std::vector<std::vector<double>> const & vec)
    -> std::vector<double>
{
    auto n_col = vec.size();
    auto n_row = vec[0].size();
    std::vector<double> out_vec(n_col * n_row);
    for (size_t i = 0; i < n_col; ++i)
        for (size_t j = 0; j < n_row; ++j)
            out_vec[j * n_col + i] = vec[i][j];
    return out_vec;
}

auto to_dense_row_major_blocking(std::vector<std::vector<double>> const & vec)
    -> std::vector<double>
{
    auto n_col = vec.size();
    auto n_row = vec[0].size();
    std::vector<double> out_vec(n_col * n_row);
    size_t block_side = 8;

    for (size_t l = 0; l < n_col; l += block_side) {
        for (size_t k = 0; k < n_row; k += block_side) {
            for (size_t j = l; j < l + block_side && j < n_col; ++j) {
                auto const &column = vec[j];
                for (size_t i = k; i < k + block_side && i < n_row; ++i)
                    out_vec[i * n_col + j] = column[i];
            }
        }
    }
    return out_vec;
}

auto to_dense_column_major_blocking(std::vector<std::vector<double>> const & vec)
    -> std::vector<double>
{
    auto n_col = vec.size();
    auto n_row = vec[0].size();
    std::vector<double> out_vec(n_col * n_row);
    size_t block_side = 8;

    for (size_t l = 0; l < n_col; l += block_side) {
        for (size_t k = 0; k < n_row; k += block_side) {
            for (size_t j = l; j < l + block_side && j < n_col; ++j) {
                auto const &column = vec[j];
                for (size_t i = k; i < k + block_side && i < n_row; ++i)
                    out_vec[j * n_row + i] = column[i];
            }
        }
    }
    return out_vec;
}

auto make_vecvec() -> std::vector<std::vector<double>>
{
  std::vector<std::vector<double>> vecvec(50, std::vector<double>(4000));
  std::mt19937 mersenne {2019};
  std::uniform_real_distribution<double> dist(-1000, 1000);
  for (auto &vec: vecvec)
   for (auto &val: vec)
       val = dist(mersenne);
  return vecvec;
}

static void NaiveColumnMajor(benchmark::State& state) {
  // Code before the loop is not measured

  auto vecvec = make_vecvec();
  for (auto _ : state) {
    benchmark::DoNotOptimize(to_dense_column_major_naive(vecvec));
  }
}
BENCHMARK(NaiveColumnMajor);

static void NaiveRowMajor(benchmark::State& state) {
  // Code before the loop is not measured

  auto vecvec = make_vecvec();
  for (auto _ : state) {
    benchmark::DoNotOptimize(to_dense_row_major_naive(vecvec));
  }
}
BENCHMARK(NaiveRowMajor);

static void BlockingRowMajor(benchmark::State& state) {
  // Code before the loop is not measured

  auto vecvec = make_vecvec();
  for (auto _ : state) {
    benchmark::DoNotOptimize(to_dense_row_major_blocking(vecvec));
  }
}
BENCHMARK(BlockingRowMajor);

static void BlockingColumnMajor(benchmark::State& state) {
  // Code before the loop is not measured

  auto vecvec = make_vecvec();
  for (auto _ : state) {
    benchmark::DoNotOptimize(to_dense_column_major_blocking(vecvec));
  }
}
BENCHMARK(BlockingColumnMajor);

355

asked Mar 19 '19 02:03

Eric Hansen

1 Answers

First of all, I cringe whenever something is qualified as "obviously". That word is often used to cover up a shortcoming in one's deductions.

But obviously a similar approach is very slow for row-wise conversion, because of all of the cache misses.

I'm not sure which is supposed to be obvious: that the row-wise conversion would be slow, or that it's slow because of cache misses. In either case, I find it not obvious. After all, there are two caching considerations here, aren't there? One for reading and one for writing? Let's look at the code from the reading perspective:

row_major_naive

for (size_t i = 0; i < n_col; ++i)
    for (size_t j = 0; j < n_row; ++j)
        out_vec[j * n_col + i] = vec[i][j];

Successive reads from vec are reads of contiguous memory: vec[i][0] followed by vec[i][1], etc. Very good for caching. So... cache misses? Slow? :) Maybe not so obvious.

Still, there is something to be gleaned from this. The claim is only wrong by claiming "obviously". There are non-locality issues, but they occur on the writing end. (Successive writes are offset by the space for 50 double values.) And empirical testing confirms the slowness. So maybe a solution is to flip on what is considered "obvious"?

row major flipped

for (size_t j = 0; j < n_row; ++j)
    for (size_t i = 0; i < n_col; ++i)
        out_vec[j * n_col + i] = vec[i][j];

All I did here was reverse the loops. Literally swap the order of those two lines of code then adjust the indentation. Now successive reads are potentially all over the place, as they read from different vectors. However, successive writes are now to contiguous blocks of memory. In one sense, we are in the same situation as before. But just like before, one should measure performance before assuming "fast" or "slow".

NaiveColumnMajor: 3.4 seconds
NaiveRowMajor: 7.7 seconds
FlippedRowMajor: 4.2 seconds
BlockingRowMajor: 4.4 seconds
BlockingColumnMajor: 3.9 seconds

Still slower than the naive column major conversion. However, this approach is not only faster than naive row major, but it's also faster than blocking row major. At least on my computer (using gcc -O3 and obviously :P iterating thousands of times). Mileage may vary. I don't know what the fancy profiling tools would say. The point is that sometimes simpler is better.

For funsies I did a test where the dimensions are swapped (changing from 50 vectors of 4000 elements to 4000 vectors of 50 elements). All methods got hurt this way, but "NaiveRowMajor" took the biggest hit. Worth noting is that "flipped row major" fell behind the blocking version. So, as one might expect, the best tool for the job depends on what exactly the job is.

NaiveColumnMajor: 3.7 seconds
NaiveRowMajor: 16 seconds
FlippedRowMajor: 5.6 seconds
BlockingRowMajor: 4.9 seconds
BlockingColumnMajor: 4.5 seconds

(By the way, I also tried the flipping trick on the blocking version. The change was small -- around 0.2 -- and opposite of flipping the naive version. That is, "flipped blocking" was slower than "blocking" for the question's 50-of-4000 vectors, but faster for my 4000-of-50 variant. Fine tuning might improve the results.)

Update: I did a little more testing with the flipping trick on the blocking version. This version has four loops, so "flipping" is not as straight-forward as when there are only two loops. It looks like swapping the order of the outer two loops is bad for performance, while swapping the inner two loops is good. (Initially, I had done both and gotten mixed results.) When I swapped just the inner loops, I measured 3.8 seconds (and 4.1 seconds in the 4000-of-50 scenario), making this the best row-major option in my tests.

row major hybrid

for (size_t l = 0; l < n_col; l += block_side)
    for (size_t i = 0; i < n_row; ++i)
        for (size_t j = l; j < l + block_side && j < n_col; ++j)
            out_vec[i * n_col + j] = vec[j][i];

(After swapping the inner loops, I merged the middle loops.)

As for the theory behind this, I would guess that this amounts to trying to write one cache block at a time. Once a block is written, try to re-use vectors (the vec[j]) before they get ejected from the cache. After you exhaust those source vectors, move on to a new group of source vectors, again writing full blocks at a time.

answered Oct 19 '22 16:10

JaMiT

Related questions
                            
                                Is an empty class usable as a constexpr variable without an initializer or explicit default constructor?
                            
                                Is it allowed to use decltype on std::declval<T> (the function itself, not the result of calling it)?
                            
                                Cast between structs from different scopes
                            
                                Clang and GCC disagree on legality of direct initialization with conversion operator
                            
                                Is it possible to parameterize the constness of a templated member function?
                            
                                clarification of specifics of P0137
                            
                                C++ shared_mutex implementation
                            
                                Parameter of returned generic lambda allegedly shadows parameter of free function
                            
                                Change default C++ standard in g++
                            
                                Blocking on many locks/futures/etc. until any is ready
                            
                                How to check if every type in a parameter pack is unique? [duplicate]
                            
                                Is it possible that a store with memory_order_relaxed never reaches other threads?
                            
                                Extending a type in C++
                            
                                How to force pow(float, int) to return float
                            
                                Asynchronous model in grpc c++
                            
                                C++ Overloading different with signed and unsigned int
                            
                                What should tuple_map return?
                            
                                Ternary allowed to call an explicit copy constructor implicitly?
                            
                                Does passing a `unique_ptr` by value have a performance penalty compared to a plain pointer?
                            
                                What's the fastest way to perform an arbitrary 128/256/512 bit permutation using SIMD instructions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With