Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding std::transform and how to beat it

I am trying to understand data-oriented design on a simple, specific problem. Apologies in advance to data-oriented design people, if I am doing something very stupid, but I am having a hard time understanding why and where my reasoning fails.

Assume that I have a simple operation, i.e., float_t result = int_t(lhs) / int_t(rhs). If I keep all of the variables in their corresponding containers, e.g., std::vector<float_t> and std::vector<int_t>, and I use std::transform, I get the correct result. Then, for a specific example where using float_t = float and using int_t = int16_t, I assume that packing these variables inside a struct, on a 64-bit architecture, and collecting them inside a container should yield better performance.

I reason that the struct makes up a 64-bit object, and a single memory access to the struct will give me all the variables I need. On the other hand, when all these variables are collected in different containers, I will need three different memory accesses to get the information needed. Below is how I set up the environment:

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>

using namespace std::chrono;

template <class float_t, class int_t> struct Packed {
  float_t sinvl;
  int_t s, l;
  Packed() = default;
  Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
  void comp() { sinvl = float_t(l) / s; }
};

using my_float = float;
using my_int = int16_t;

int main(int argc, char *argv[]) {
  constexpr uint32_t M{100};
  for (auto N : {1000, 10000, 100000}) {
    double t1{0}, t2{0};
    for (uint32_t m = 0; m < M; m++) {
      std::vector<my_float> sinvl(N, 0.0);
      std::vector<my_int> s(N, 3), l(N, 2);
      std::vector<Packed<my_float, my_int>> p1(
          N, Packed<my_float, my_int>(0.0, 3, 2));

      // benchmark unpacked
      auto tstart = high_resolution_clock::now();
      std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
                     std::divides<my_float>{}); // 3 different memory accesses
      auto tend = high_resolution_clock::now();
      t1 += duration_cast<microseconds>(tend - tstart).count();

      if (m == M - 1)
        std::cout << "sinvl[0]: " << sinvl[0] << '\n';

      // benchmark packed
      tstart = high_resolution_clock::now();
      for (auto &elem : p1) // 1 memory access
        elem.comp();
      tend = high_resolution_clock::now();
      t2 += duration_cast<microseconds>(tend - tstart).count();

      if (m == M - 1)
        std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
    }
    std::cout << "N = " << N << ", unpacked: " << (t1 / M) << " us.\n";
    std::cout << "N = " << N << ", packed: " << (t2 / M) << " us.\n";
  }
  return 0;
}

The compiled code with g++ -O3 yields, on my machine,

sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 1000, unpacked: 0 us.
N = 1000, packed: 1 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 10000, unpacked: 5.06 us.
N = 10000, packed: 12.97 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 100000, unpacked: 52.31 us.
N = 100000, packed: 124.49 us.

Basically, std::transform beats the packed access by 2.5x. I would appreciate if you helped me understand the behaviour. Is the result due to

  1. me not grasping the data-oriented design principles correctly, or,
  2. some artefact of this very simple example such as the memory locations getting allocated very close to each other and in some way getting optimized very efficiently by the compiler?

Finally, is there a way to beat std::transform in this example, or, is it simply good enough to be a go-to solution? I am not an expert neither in compiler optimizations nor in data-oriented design, and thus, I could not answer this question myself.

Thanks!

EDIT. I have changed the way I test both of the methods as per @bolov's suggestion in the comments.

Now the code looks like:

#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>

using namespace std::chrono;

template <class float_t, class int_t> struct Packed {
  float_t sinvl;
  int_t s, l;
  Packed() = default;
  Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
  void comp() { sinvl = float_t(l) / s; }
};

using my_float = float;
using my_int = int16_t;

int main(int argc, char *argv[]) {
  uint32_t N{1000};
  double t{0};

  if (argc == 2)
    N = std::stoul(argv[1]);

#ifndef _M_PACKED
  std::vector<my_float> sinvl(N, 0.0);
  std::vector<my_int> s(N, 3), l(N, 2);

  // benchmark unpacked
  auto tstart = high_resolution_clock::now();
  std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
                 std::divides<my_float>{}); // 3 different memory accesses
  auto tend = high_resolution_clock::now();
  t += duration_cast<microseconds>(tend - tstart).count();

  std::cout << "sinvl[0]: " << sinvl[0] << '\n';
  std::cout << "N = " << N << ", unpacked: " << t << " us.\n";
#else
  std::vector<Packed<my_float, my_int>> p1(N,
                                           Packed<my_float, my_int>(0.0, 3, 2));
  // benchmark packed
  auto tstart = high_resolution_clock::now();
  for (auto &elem : p1) // 1 memory access
    elem.comp();
  auto tend = high_resolution_clock::now();
  t += duration_cast<microseconds>(tend - tstart).count();

  std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
  std::cout << "N = " << N << ", packed: " << t << " us.\n";
#endif

  return 0;
}

with the corresponding shell (fish) script

g++ -Wall -std=c++11 -O3 transform.cpp -o transform_unpacked.out
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_packed.out -D_M_PACKED
for N in 1000 10000 100000
  echo "Testing unpacked for N = $N"
  ./transform_unpacked.out $N
  ./transform_unpacked.out $N
  ./transform_unpacked.out $N
  echo "Testing packed for N = $N"
  ./transform_packed.out $N
  ./transform_packed.out $N
  ./transform_packed.out $N
end

which gives the following:

Testing unpacked for N = 1000
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
Testing packed for N = 1000
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
Testing unpacked for N = 10000
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
Testing packed for N = 10000
p1[0].sinvl: 0.666667
N = 10000, packed: 17 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
Testing unpacked for N = 100000
sinvl[0]: 0.666667
N = 100000, unpacked: 64 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
Testing packed for N = 100000
p1[0].sinvl: 0.666667
N = 100000, packed: 180 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 198 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 177 us.

I hope I have understood the proper testing method, correctly. Still, though, the difference is 2-3 folds.

like image 876
Arda Aytekin Avatar asked Nov 01 '17 10:11

Arda Aytekin


People also ask

Is std :: transform faster?

I timed the difference between both implementations using google benchmark and came to the conclusion that the loop is about 5 times faster than using std::transform.

What does std :: transform do?

std::transform. std::transform applies the given function to a range and stores the result in another range, keeping the original elements order and beginning at d_first . 1) The unary operation unary_op is applied to the range defined by [first1, last1) .

How does transform work in C++?

The transform() function in C++ sequentially applies an operation to the elements of an array(s) and then stores the result in another output array. The transform function is used in two forms: Unary operation: The operation is applied to each element in the input range, and the result is stored in the output array.

Why is std :: array more efficient than std :: vector if you know the number of elements you need?

Difference between std::vector and std::array in C++ Vector is dynamic in nature so, size increases with insertion of elements. As array is fixed size, once initialized can't be resized. Vector occupies more memory. Array is memory efficient data structure.


1 Answers

Here's the compiled loop of the std::transform case:

  400fd0:       f3 41 0f 7e 04 47       movq   xmm0,QWORD PTR [r15+rax*2]
  400fd6:       66 0f 61 c0             punpcklwd xmm0,xmm0
  400fda:       66 0f 72 e0 10          psrad  xmm0,0x10
  400fdf:       0f 5b c0                cvtdq2ps xmm0,xmm0
  400fe2:       f3 0f 7e 0c 43          movq   xmm1,QWORD PTR [rbx+rax*2]
  400fe7:       66 0f 61 c9             punpcklwd xmm1,xmm1
  400feb:       66 0f 72 e1 10          psrad  xmm1,0x10
  400ff0:       0f 5b c9                cvtdq2ps xmm1,xmm1
  400ff3:       0f 5e c1                divps  xmm0,xmm1
  400ff6:       41 0f 11 04 80          movups XMMWORD PTR [r8+rax*4],xmm0
  400ffb:       f3 41 0f 7e 44 47 08    movq   xmm0,QWORD PTR [r15+rax*2+0x8]
  401002:       66 0f 61 c0             punpcklwd xmm0,xmm0
  401006:       66 0f 72 e0 10          psrad  xmm0,0x10
  40100b:       0f 5b c0                cvtdq2ps xmm0,xmm0
  40100e:       f3 0f 7e 4c 43 08       movq   xmm1,QWORD PTR [rbx+rax*2+0x8]
  401014:       66 0f 61 c9             punpcklwd xmm1,xmm1
  401018:       66 0f 72 e1 10          psrad  xmm1,0x10
  40101d:       0f 5b c9                cvtdq2ps xmm1,xmm1
  401020:       0f 5e c1                divps  xmm0,xmm1
  401023:       41 0f 11 44 80 10       movups XMMWORD PTR [r8+rax*4+0x10],xmm0
  401029:       48 83 c0 08             add    rax,0x8
  40102d:       48 83 c1 02             add    rcx,0x2
  401031:       75 9d                   jne    400fd0 <main+0x570>

In each loop cycle, it processes 8 elements (there are two divps instructions, each does 4 divisions).

Here's the other case:

  401190:       f3 0f 6f 42 04          movdqu xmm0,XMMWORD PTR [rdx+0x4]
  401195:       f3 0f 6f 4a 14          movdqu xmm1,XMMWORD PTR [rdx+0x14]
  40119a:       66 0f 70 c9 e8          pshufd xmm1,xmm1,0xe8
  40119f:       66 0f 70 c0 e8          pshufd xmm0,xmm0,0xe8
  4011a4:       f2 0f 70 d0 e8          pshuflw xmm2,xmm0,0xe8
  4011a9:       66 0f 6c c1             punpcklqdq xmm0,xmm1
  4011ad:       66 0f 72 e0 10          psrad  xmm0,0x10
  4011b2:       0f 5b c0                cvtdq2ps xmm0,xmm0
  4011b5:       f2 0f 70 c9 e8          pshuflw xmm1,xmm1,0xe8
  4011ba:       66 0f 62 d1             punpckldq xmm2,xmm1
  4011be:       66 0f 61 ca             punpcklwd xmm1,xmm2
  4011c2:       66 0f 72 e1 10          psrad  xmm1,0x10
  4011c7:       0f 5b c9                cvtdq2ps xmm1,xmm1
  4011ca:       0f 5e c1                divps  xmm0,xmm1
  4011cd:       f3 0f 11 02             movss  DWORD PTR [rdx],xmm0
  4011d1:       0f 28 c8                movaps xmm1,xmm0
  4011d4:       0f c6 c9 e5             shufps xmm1,xmm1,0xe5
  4011d8:       f3 0f 11 4a 08          movss  DWORD PTR [rdx+0x8],xmm1
  4011dd:       0f 28 c8                movaps xmm1,xmm0
  4011e0:       0f 12 c9                movhlps xmm1,xmm1
  4011e3:       f3 0f 11 4a 10          movss  DWORD PTR [rdx+0x10],xmm1
  4011e8:       0f c6 c0 e7             shufps xmm0,xmm0,0xe7
  4011ec:       f3 0f 11 42 18          movss  DWORD PTR [rdx+0x18],xmm0
  4011f1:       48 83 c2 20             add    rdx,0x20
  4011f5:       48 83 c1 fc             add    rcx,0xfffffffffffffffc
  4011f9:       75 95                   jne    401190 <main+0x730>

In each loop cycle, it processes 4 elements (there is one divps instruction).

In the first case, data is in a good format, SIMD instructions can operate on them (almost) without any data-moving, and the result can be written easily (4 results are written with a single instruction).

In the second case, however, this is not the case. The compiler had to emit a lot of data-moving (shuffle) operations, and each result is written with a separate instruction. So the input/output is not in a SIMD friendly format.

I don't have the time to analyse this issue further, but if you just take the fact that both of this snippets has similar size, similar instructions, but the first processes twice as much as elements as the second one, you can have the idea why the second is slower. Sorry about the sloppy explanation.

like image 148
geza Avatar answered Nov 15 '22 10:11

geza