I have following issue:
The write times to an std::array
for int8
, int16
, int32
and int64
are doubling with each size increase. I can understand such behavior for an 8-bit CPU, but not 32/64-bit.
Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values?
Here is my test code:
#include <iostream>
#include <array>
#include <chrono>
std::array<std::int8_t, 64 * 1024 * 1024> int8Array;
std::array<std::int16_t, 64 * 1024 * 1024> int16Array;
std::array<std::int32_t, 64 * 1024 * 1024> int32Array;
std::array<std::int64_t, 64 * 1024 * 1024> int64Array;
void PutZero()
{
auto point1 = std::chrono::high_resolution_clock::now();
for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();
std::cout << "Time of processing int8 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point2 - point1)).count() << "us." << std::endl;
std::cout << "Time of processing int16 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point3 - point2)).count() << "us." << std::endl;
std::cout << "Time of processing int32 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point4 - point3)).count() << "us." << std::endl;
std::cout << "Time of processing int64 array:\t" << (std::chrono::duration_cast<std::chrono::microseconds>(point5 - point4)).count() << "us." << std::endl;
}
int main()
{
PutZero();
std::cout << std::endl << "Press enter to exit" << std::endl;
std::cin.get();
return 0;
}
I compile it under linux with: g++ -o array_issue_1 main.cpp -O3 -std=c++14
and my results are following:
Time of processing int8 array: 9922us.
Time of processing int16 array: 37717us.
Time of processing int32 array: 76064us.
Time of processing int64 array: 146803us.
If I compile with -O2
, then results are 5 times worse for int8
!
You can also compile this source in Windows. You will get similar relation between results.
Update #1
When I compile with -O2, then my results are following:
Time of processing int8 array: 60182us.
Time of processing int16 array: 77807us.
Time of processing int32 array: 114204us.
Time of processing int64 array: 186664us.
I didn't analyze assembler output. My main point is that I would like to write efficient code in C++ and things like that show, that things like std::array
can be challenging from performance perspective and somehow counter-intuitive.
Why does a 32-bit system need 4 times more time to save 32-bit values than to save 8-bit values?
It doesn't. But there are 3 different issues with your benchmark that are giving you those results.
-O3
is completely defeating your benchmark by converting all your loops into memset()
.Problem 1: The Test Data is not Prefaulted
Your arrays are declared, but not used before the benchmark. Because of the way the kernel and memory allocation works, they are not mapped into memory yet. It's only when you first touch them does this happen. And when it does, it incurs a very large penalty from the kernel to map the page.
This can be done by touching all the arrays before the benchmark.
No Pre-Faulting: http://coliru.stacked-crooked.com/a/1df1f3f9de420d18
g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array: 28983us.
Time of processing int16 array: 57100us.
Time of processing int32 array: 113361us.
Time of processing int64 array: 224451us.
With Pre-Faulting: http://coliru.stacked-crooked.com/a/7e62b9c7ca19c128
g++ -O3 -Wall main.cpp && ./a.out
Time of processing int8 array: 6216us.
Time of processing int16 array: 12472us.
Time of processing int32 array: 24961us.
Time of processing int64 array: 49886us.
The times drop by roughly a factor of 4. In other words, your original benchmark was measuring more of the kernel than the actual code.
Problem 2: The Compiler is Defeating the Benchmark
The compiler is recognizing your pattern of writing zeros and is completely replacing all your loops with calls to memset()
. So in effect, you're measuring calls to memset()
with different sizes.
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 67108864
mov edi, OFFSET FLAT:int8Array
mov r14, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 134217728
mov edi, OFFSET FLAT:int16Array
mov r13, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 268435456
mov edi, OFFSET FLAT:int32Array
mov r12, rax
call memset
call std::chrono::_V2::system_clock::now()
xor esi, esi
mov edx, 536870912
mov edi, OFFSET FLAT:int64Array
mov rbp, rax
call memset
call std::chrono::_V2::system_clock::now()
The optimization that's doing this is -ftree-loop-distribute-patterns
. Even if you turn that off, the vectorizer will give you a similar effect.
With -O2
, vectorization and pattern recognition are both disabled. So the compiler gives you what you write.
.L4:
mov BYTE PTR [rax], 0 ;; <<------ 1 byte at a time
add rax, 1
cmp rdx, rax
jne .L4
call std::chrono::_V2::system_clock::now()
mov rbp, rax
mov eax, OFFSET FLAT:int16Array
lea rdx, [rax+134217728]
.L5:
xor ecx, ecx
add rax, 2
mov WORD PTR [rax-2], cx ;; <<------ 2 bytes at a time
cmp rdx, rax
jne .L5
call std::chrono::_V2::system_clock::now()
mov r12, rax
mov eax, OFFSET FLAT:int32Array
lea rdx, [rax+268435456]
.L6:
mov DWORD PTR [rax], 0 ;; <<------ 4 bytes at a time
add rax, 4
cmp rax, rdx
jne .L6
call std::chrono::_V2::system_clock::now()
mov r13, rax
mov eax, OFFSET FLAT:int64Array
lea rdx, [rax+536870912]
.L7:
mov QWORD PTR [rax], 0 ;; <<------ 8 bytes at a time
add rax, 8
cmp rdx, rax
jne .L7
call std::chrono::_V2::system_clock::now()
With -O2
: http://coliru.stacked-crooked.com/a/edfdfaaf7ec2882e
g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array: 28414us.
Time of processing int16 array: 22617us.
Time of processing int32 array: 32551us.
Time of processing int64 array: 56591us.
Now it's clear that the smaller word sizes are slower. But you would expect the times to be flat if all the word sizes were the same speed. And the reason they aren't is because of memory bandwidth.
Problem 3: Memory Bandwidth
Because the benchmark (as written) is only writing zeros, it is easily saturating the memory bandwidth for the core/system. So the benchmark becomes affected by how much memory is touched.
To fix that, we need to shrink the dataset so that it fits into cache. To compensate for this, we loop over the same data multiple times.
std::array<std::int8_t, 512> int8Array;
std::array<std::int16_t, 512> int16Array;
std::array<std::int32_t, 512> int32Array;
std::array<std::int64_t, 512> int64Array;
...
auto point1 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int8Array) v = 0;
auto point2 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int16Array) v = 0;
auto point3 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int32Array) v = 0;
auto point4 = std::chrono::high_resolution_clock::now();
for (int c = 0; c < 64 * 1024; c++) for (auto &v : int64Array) v = 0;
auto point5 = std::chrono::high_resolution_clock::now();
Now we see timings that are a lot more flat for the different word-sizes:
http://coliru.stacked-crooked.com/a/f534f98f6d840c5c
g++ -O2 -Wall main.cpp && ./a.out
Time of processing int8 array: 20487us.
Time of processing int16 array: 21965us.
Time of processing int32 array: 32569us.
Time of processing int64 array: 26059us.
The reason why it isn't completely flat is probably because there are numerous other factors involved with the compiler optimizations. You might need to resort to loop-unrolling to get any closer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With