Raw Loop on an array of bool is 5 times faster than transform or for_each

Tags:

Based on my previous experience with benchmarking transform and for_each, they usually perform slightly faster than raw loops and of course, they are safer, so I tried to replace all my raw loops with transform, generate and for_each. Today, I compared how fast I can flip booleans using for_each, transform and raw loops, and I got very surprising results. raw_loop performs 5 times faster than the other two. I was not really able to find a good reason why we get this massive difference?

#include <array>
#include <algorithm>


static void ForEach(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    std::for_each(a.begin(), a.end(), [](auto & arg) { arg = !arg; });
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(ForEach);

static void Transform(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    std::transform(a.begin(), a.end(), a.begin(), [](auto arg) { return !arg; });
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(Transform);


static void RawLoop(benchmark::State& state) {
  std::array<bool, sizeof(short) * 8> a;
  std::fill(a.begin(), a.end(), true);

  for (auto _ : state) {
    for (int i = 0; i < a.size(); i++) {
      a[i] = !a[i];
    }
    benchmark::DoNotOptimize(a);
  }
}
BENCHMARK(RawLoop);

clang++ (7.0) -O3 -libc++ (LLVM)

enter image description here

808

asked Jul 03 '19 15:07

apramc

1 Answers

In this example, clang vectorizes indexing but (mistakenly) fails to vectorize iterating.

To summarize the results, there is no difference between using a raw loop and using std::transform or std::for_each. There IS, however, a difference between using indexing and using iterating, and for the purposes of this particular problem, clang is better at optimizing indexing than it is at optimizing iterating because indexing gets vectorized. std::transform and std::for_each use iterating, so they end up being slower (when compiled under clang).

What's the difference between indexing and iterating? - Indexing is when you use an integer to index into an array - Iterating is when you increment a pointer from begin() to end().

Let's write the raw loop using indexing and using iterating, and compare the performance of iterating (with a raw loop) to indexing.

// Indexing
for(int i = 0; i < a.size(); i++) {
    a[i] = !a[i];
}

// Iterating, used by std::for_each and std::transform
bool* begin = a.data();
bool* end   = begin + a.size(); 
for(; begin != end; ++begin) {
    *begin = !*begin; 
}

The example using indexing is better-optimized, and runs 4-5 times faster when compiled with clang.

In order to demonstrate this, let's add two additional tests, both using a raw loop. One will use an iterator, and the other one will use raw pointers.


static void RawLoopIt(benchmark::State& state) {
  std::array<bool, 16> a;
  std::fill(a.begin(), a.end(), true); 

  for(auto _ : state) {
    auto scan = a.begin(); 
    auto end = a.end(); 
    for (; scan != end; ++scan) {
      *scan = !*scan; 
    }
    benchmark::DoNotOptimize(a); 
  }
 }

BENCHMARK(RawLoopIt); 

static void RawLoopPtr(benchmark::State& state) {
  std::array<bool, 16> a;
  std::fill(a.begin(), a.end(), true); 

  for(auto _ : state) {
    bool* scan = a.data(); 
    bool* end = scan + a.size(); 
    for (; scan != end; ++scan) {
      *scan = !*scan; 
    }
    benchmark::DoNotOptimize(a); 
  } 
}

BENCHMARK(RawLoopPtr);

When using a pointer or an iterator from begin to end, these functions identical in performance to using std::for_each or std::transform.

Clang Quick-bench results:

enter image description here

This is confirmed by running the clang benchmark locally:

me@K2SO:~/projects/scratch$ clang++ -O3 bench.cpp -lbenchmark -pthread -o clang-bench
me@K2SO:~/projects/scratch$ ./clang-bench
2019-07-05 16:13:27
Running ./clang-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.44, 0.55, 0.59
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ForEach          8.32 ns         8.32 ns     83327615
Transform        8.29 ns         8.28 ns     82536410
RawLoop          1.92 ns         1.92 ns    361745495
RawLoopIt        8.31 ns         8.31 ns     81848945
RawLoopPtr       8.28 ns         8.28 ns     82504276

GCC does not have this problem.

For the purposes of this example, there is no fundamental difference between indexing or iterating. Both of them apply an identical transformation to the array, and the compiler should be able to compile them identically.

Indeed, GCC is able to do this, with all methods running faster than the corresponding version compiled under clang.

GCC Quick-bench results: enter image description here

GCC Local results:

2019-07-05 16:13:35
Running ./gcc-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.52, 0.57, 0.60
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ForEach          1.43 ns         1.43 ns    484760981
Transform        1.44 ns         1.44 ns    485788409
RawLoop          1.43 ns         1.43 ns    484973417
RawLoopIt        1.44 ns         1.44 ns    482685685
RawLoopPtr       1.44 ns         1.44 ns    483736235

Is indexing actually faster than iterating? No. Indexing is faster because clang vectorizes it.

Under the hood, neither iterating nor indexing occurs. Instead, gcc and clang vectorize the operation by treating the array as two 64-bit integers, and using a bitwise-xor on them. We can see this reflected in the assembly used to flip the bits:

       movabs $0x101010101010101,%rax
       nopw   %cs:0x0(%rax,%rax,1)
       xor    %rax,(%rsp)
       xor    %rax,0x8(%rsp)
       sub    $0x1,%rbx

Iterating is slower when compiled by clang because for some reason, clang fails to vectorize the operation when iterating is used. This is a defect in clang, and one that applies specifically to this problem. As clang improves, this discrepancy should disappear, and it's not something I would worry about for now.

Don't micro-optimize. Let the compiler handle that, and if necessary, test whether gcc or clang produces faster code for your particular use-case. Neither is better for all cases. For example, clang is better at vectorizing some math operations.

answered Oct 09 '22 13:10

Alecto Irene Perez

Related questions
                            
                                Numpy C++ program always gives segfault (most likely misuse of syntax or types)
                            
                                Unicode problems in C++ but not C
                            
                                How to use STL-compliant allocators for heterogeneous memory allocations
                            
                                Expanding a parameter pack of template<class>classes
                            
                                Alignment of atomic variables
                            
                                Using node-gyp Predefined Variable PRODUCT_DIR
                            
                                Adding a tooltip to CMenu item(s)
                            
                                Forwarding arguments to template member function
                            
                                My mistake, or bug in intel compiler? sizeof a non-static member error
                            
                                How to see macro expansions step-by-step?
                            
                                Automatically detect identical consecutive std::string::find() calls
                            
                                Partially Specialized Structs vs Overloaded function template
                            
                                How can I get faster compilation speed with a boost.parameter like syntax?
                            
                                Warnings for uninitialized members disappear on the C++11
                            
                                enable_if not working in Visual Studio when using a constexpr function as argument
                            
                                Generic "transitive behavior" in CMake Package Configuration File (find_dependency)
                            
                                Could we access member of a non-existing class type object?
                            
                                vscode plugin for c++ auto add include header
                            
                                How to enable "Long Path Aware" behavior via manifest in a C++ executable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Raw Loop on an array of bool is 5 times faster than transform or for_each

Tags:

c++

performance

c++17

clang++