Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do arrays of different integer sizes have different performance?

I have following issue:

The write times to an std::array for int8, int16, int32 and int64 are doubling with each size increase. I can understand such behavior for an 8-bit CPU, but not 32/64-bit.

Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values?

Here is my test code:

#include <iostream>
#include <array>
#include <chrono>

std::array<std::int8_t, 64 * 1024 * 1024> int8Array;
std::array<std::int16_t, 64 * 1024 * 1024> int16Array;
std::array<std::int32_t, 64 * 1024 * 1024> int32Array;
std::array<std::int64_t, 64 * 1024 * 1024> int64Array;

void PutZero()
{
    auto point1 = std::chrono::high_resolution_clock::now();
    for (auto &v : int8Array) v = 0;
    auto point2 = std::chrono::high_resolution_clock::now();
    for (auto &v : int16Array) v = 0;
    auto point3 = std::chrono::high_resolution_clock::now();
    for (auto &v : int32Array) v = 0;
    auto point4 = std::chrono::high_resolution_clock::now();
    for (auto &v : int64Array) v = 0;
    auto point5 = std::chrono::high_resolution_clock::now();
    std::cout << "Time of processing int8 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point2 - point1)).count() << "us." << std::endl;
    std::cout << "Time of processing int16 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point3 - point2)).count() << "us." << std::endl;
    std::cout << "Time of processing int32 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point4 - point3)).count() << "us." << std::endl;
    std::cout << "Time of processing int64 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point5 - point4)).count() << "us." << std::endl;
}

int main()
{
    PutZero();
    std::cout << std::endl << "Press enter to exit" << std::endl;
    std::cin.get();
    return 0;
}

I compile it under linux with: g++ -o array_issue_1 main.cpp -O3 -std=c++14

and my results are following:

Time of processing int8 array:  9922us.   
Time of processing int16 array: 37717us.   
Time of processing int32 array: 76064us.   
Time of processing int64 array: 146803us.   

If I compile with -O2, then results are 5 times worse for int8!

You can also compile this source in Windows. You will get similar relation between results.

Update #1

When I compile with -O2, then my results are following:

Time of processing int8 array:  60182us.  
Time of processing int16 array: 77807us.  
Time of processing int32 array: 114204us.  
Time of processing int64 array: 186664us.  

I didn't analyze assembler output. My main point is that I would like to write efficient code in C++ and things like that show, that things like std::array can be challenging from performance perspective and somehow counter-intuitive.

like image 530
wrealcon Avatar asked Oct 18 '17 18:10

wrealcon


1 Answers

Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values?

It doesn't. But there are 3 different issues with your benchmark that are giving you those results.

  1. You're not pre-faulting the memory. So you're page-faulting the arrays during the benchmark. These page faults along with the OS kernel interaction are a dominant factor in the time.
  2. The compiler with -O3 is completely defeating your benchmark by converting all your loops into memset().
  3. Your benchmark is memory-bound. So you're measuring the speed of your memory instead of the core.

Problem 1: The Test Data is not Prefaulted

Your arrays are declared, but not used before the benchmark. Because of the way the kernel and memory allocation works, they are not mapped into memory yet. It's only when you first touch them does this happen. And when it does, it incurs a very large penalty from the kernel to map the page.

This can be done by touching all the arrays before the benchmark.

No Pre-Faulting: http://coliru.stacked-crooked.com/a/1df1f3f9de420d18

g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array:  28983us.
Time of processing int16 array: 57100us.
Time of processing int32 array: 113361us.
Time of processing int64 array: 224451us.

With Pre-Faulting: http://coliru.stacked-crooked.com/a/7e62b9c7ca19c128

g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array:  6216us.
Time of processing int16 array: 12472us.
Time of processing int32 array: 24961us.
Time of processing int64 array: 49886us.

The times drop by roughly a factor of 4. In other words, your original benchmark was measuring more of the kernel than the actual code.


Problem 2: The Compiler is Defeating the Benchmark

The compiler is recognizing your pattern of writing zeros and is completely replacing all your loops with calls to memset(). So in effect, you're measuring calls to memset() with different sizes.

  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 67108864
  mov edi, OFFSET FLAT:int8Array
  mov r14, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 134217728
  mov edi, OFFSET FLAT:int16Array
  mov r13, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 268435456
  mov edi, OFFSET FLAT:int32Array
  mov r12, rax
  call memset
  call std::chrono::_V2::system_clock::now()
  xor esi, esi
  mov edx, 536870912
  mov edi, OFFSET FLAT:int64Array
  mov rbp, rax
  call memset
  call std::chrono::_V2::system_clock::now()

The optimization that's doing this is -ftree-loop-distribute-patterns. Even if you turn that off, the vectorizer will give you a similar effect.


With -O2, vectorization and pattern recognition are both disabled. So the compiler gives you what you write.

.L4:
  mov BYTE PTR [rax], 0         ;; <<------ 1 byte at a time
  add rax, 1
  cmp rdx, rax
  jne .L4
  call std::chrono::_V2::system_clock::now()
  mov rbp, rax
  mov eax, OFFSET FLAT:int16Array
  lea rdx, [rax+134217728]
.L5:
  xor ecx, ecx
  add rax, 2
  mov WORD PTR [rax-2], cx      ;; <<------ 2 bytes at a time
  cmp rdx, rax
  jne .L5
  call std::chrono::_V2::system_clock::now()
  mov r12, rax
  mov eax, OFFSET FLAT:int32Array
  lea rdx, [rax+268435456]
.L6:
  mov DWORD PTR [rax], 0        ;; <<------ 4 bytes at a time
  add rax, 4
  cmp rax, rdx
  jne .L6
  call std::chrono::_V2::system_clock::now()
  mov r13, rax
  mov eax, OFFSET FLAT:int64Array
  lea rdx, [rax+536870912]
.L7:
  mov QWORD PTR [rax], 0        ;; <<------ 8 bytes at a time
  add rax, 8
  cmp rdx, rax
  jne .L7
  call std::chrono::_V2::system_clock::now()

With -O2: http://coliru.stacked-crooked.com/a/edfdfaaf7ec2882e

g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array:  28414us.
Time of processing int16 array: 22617us.
Time of processing int32 array: 32551us.
Time of processing int64 array: 56591us.

Now it's clear that the smaller word sizes are slower. But you would expect the times to be flat if all the word sizes were the same speed. And the reason they aren't is because of memory bandwidth.


Problem 3: Memory Bandwidth

Because the benchmark (as written) is only writing zeros, it is easily saturating the memory bandwidth for the core/system. So the benchmark becomes affected by how much memory is touched.

To fix that, we need to shrink the dataset so that it fits into cache. To compensate for this, we loop over the same data multiple times.

std::array<std::int8_t, 512> int8Array;
std::array<std::int16_t, 512> int16Array;
std::array<std::int32_t, 512> int32Array;
std::array<std::int64_t, 512> int64Array;

...

auto point1 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();

Now we see timings that are a lot more flat for the different word-sizes:

http://coliru.stacked-crooked.com/a/f534f98f6d840c5c

g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array:  20487us.
Time of processing int16 array: 21965us.
Time of processing int32 array: 32569us.
Time of processing int64 array: 26059us.

The reason why it isn't completely flat is probably because there are numerous other factors involved with the compiler optimizations. You might need to resort to loop-unrolling to get any closer.

like image 158
Mysticial Avatar answered Nov 05 '22 07:11

Mysticial