I decided to benchmark the realization of a swap function for simple types (like int
, or struct
, or a class
that uses only simple types in its fields) with a static
tmp variable in it to prevent memory allocation in each swap call. So I wrote this simple test program:
#include <iostream>
#include <chrono>
#include <utility>
#include <vector>
template<typename T>
void mySwap(T& a, T& b) //Like std::swap - just for tests
{
T tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
template<typename T>
void mySwapStatic(T& a, T& b) //Here with static tmp
{
static T tmp;
tmp = std::move(a);
a = std::move(b);
b = std::move(tmp);
}
class Test1 { //Simple class with some simple types
int foo;
float bar;
char bazz;
};
class Test2 { //Class with std::vector in it
int foo;
float bar;
char bazz;
std::vector<int> bizz;
public:
Test2()
{
bizz = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
}
};
#define Test Test1 //choosing class
const static unsigned int NUM_TESTS = 100000000;
static Test a, b; //making it static to prevent throwing out from code by compiler optimizations
template<typename T, typename F>
auto test(unsigned int numTests, T& a, T& b, const F swapFunction ) //test function
{
std::chrono::system_clock::time_point t1, t2;
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
swapFunction(a, b);
}
t2 = std::chrono::system_clock::now();
return std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
}
int main()
{
std::chrono::system_clock::time_point t1, t2;
std::cout << "Test 1. MySwap Result:\t\t" << test(NUM_TESTS, a, b, mySwap<Test>) << " nanoseconds\n"; //caling test function
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwap<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 2. MySwap2 Result:\t\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //This result slightly better then 1. why?!
std::cout << "Test 3. MySwapStatic Result:\t" << test(NUM_TESTS, a, b, mySwapStatic<Test>) << " nanoseconds\n"; //test function with mySwapStatic
t1 = std::chrono::system_clock::now();
for(unsigned int i = 0; i < NUM_TESTS; ++i) {
mySwapStatic<Test>(a, b);
}
t2 = std::chrono::system_clock::now();
std::cout << "Test 4. MySwapStatic2 Result:\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //And again - it's better then 3...
std::cout << "Test 5. std::swap Result:\t" << test(NUM_TESTS, a, b, std::swap<Test>) << " nanoseconds\n"; //calling test function with std::swap for comparsion. Mostly similar to 1...
return 0;
}
Some results with Test
defined as Test1
(g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 called as g++ main.cpp -O3 -std=c++11):
Test 1. MySwap Result: 625,105,480 nanoseconds
Test 2. MySwap2 Result: 528,701,547 nanoseconds
Test 3. MySwapStatic Result: 338,484,180 nanoseconds
Test 4. MySwapStatic2 Result: 228,228,156 nanoseconds
Test 5. std::swap Result: 564,863,184 nanoseconds
My main question: is it good to use this implementation for swapping of simple types? I know that if you use it for swapping types with vectors, for example, then std::swap
is better, and you can see it just by changing the Test
define to Test2
.
Second question: why are the results in test 1, 2, 3, and 4 so different? What am I doing wrong with the test function implementation?
An answer to your second question first : in your test 2 and 4, the compiler is inlining the functions, thus it gives better performances (there is even more on test 4, but I will cover this later).
Overall, it is probably a bad idea to use a static temp variable.
Why ? First, it should be noted that in x86 assembly, there is no instruction to copy from memory to memory. This means that when you swap, there is, not one, but two temporary variables in CPU registers. And these temporary variables MUST be in CPU registers, you can't copy mem to mem, thus a static variable will add a third memory location to transfer to and from.
One problem with your static temp is that it will hinder inlining. Imagine if the variables you swap are already in CPU registers. In this case, the compiler can inline the swapping, and never copy anything to memory, which is much faster. Now, if you forced a static temp, either the compiler removes it (useless), or is forced to add a memory copy. That's what happen in test 4, in which GCC removed all the reads to the static variable. It just pointlessly write updated values to it because you told it do to so. The read removal explains the good performance gain, but it could be even faster.
Your test cases are flawed because they don't show this point.
Now you may ask : Then why my static functions perform better ? I have no idea. (Answer at the end)
I was curious, so I compiled your code with MSVC, and it turns out MSVC is doing it right, and GCC is doing it weird. At O2 optimization levels, MSVC detects that two swaps is a no-op and shortcut them, but even at O1, the non-inlined generated code is faster than all the test case with GCC at O3. (EDIT: Actually, MSVC is not doing it right either, see explanation at the end.)
The assembly generated by MSVC looks indeed better, but when comparing the static and non-static assembly generated by GCC, I don't know why the static performs better.
Anyway, I think even if GCC is generating weird code, the inlining problem should be worth using the std::swap, because with bigger types the additional memory copy could be costly, and smaller types give better inlining.
Here are the assembly produced by all the test cases, if someone have an idea of why the GCC static performs better than the non-static, despite being longer and using more memory moves. EDIT: Answer at the end
GCC non-static (perf 570ms):
00402F90 44 8B 01 mov r8d,dword ptr [rcx] 00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4] 00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8] 00402F9C 4C 8B 0A mov r9,qword ptr [rdx] 00402F9F 4C 89 09 mov qword ptr [rcx],r9 00402FA2 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8] 00402FA7 44 88 49 08 mov byte ptr [rcx+8],r9b 00402FAB 44 89 02 mov dword ptr [rdx],r8d 00402FAE F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0 00402FB3 88 42 08 mov byte ptr [rdx+8],al
GCC static and MSVC static (perf 275ms):
00402F10 48 8B 01 mov rax,qword ptr [rcx] 00402F13 48 89 05 66 11 00 00 mov qword ptr [404080h],rax 00402F1A 0F B6 41 08 movzx eax,byte ptr [rcx+8] 00402F1E 88 05 64 11 00 00 mov byte ptr [404088h],al 00402F24 48 8B 02 mov rax,qword ptr [rdx] 00402F27 48 89 01 mov qword ptr [rcx],rax 00402F2A 0F B6 42 08 movzx eax,byte ptr [rdx+8] 00402F2E 88 41 08 mov byte ptr [rcx+8],al 00402F31 48 8B 05 48 11 00 00 mov rax,qword ptr [404080h] 00402F38 48 89 02 mov qword ptr [rdx],rax 00402F3B 0F B6 05 46 11 00 00 movzx eax,byte ptr [404088h] 00402F42 88 42 08 mov byte ptr [rdx+8],al
MSVC non-static (perf 215ms):
00000 f2 0f 10 02 movsdx xmm0, QWORD PTR [rdx] 00004 f2 0f 10 09 movsdx xmm1, QWORD PTR [rcx] 00008 44 8b 41 08 mov r8d, DWORD PTR [rcx+8] 0000c f2 0f 11 01 movsdx QWORD PTR [rcx], xmm0 00010 8b 42 08 mov eax, DWORD PTR [rdx+8] 00013 89 41 08 mov DWORD PTR [rcx+8], eax 00016 f2 0f 11 0a movsdx QWORD PTR [rdx], xmm1 0001a 44 89 42 08 mov DWORD PTR [rdx+8], r8d
std::swap versions are all identical to the non-static versions.
After having some fun investigating, I found the likely reason of the bad performance of the GCC non-static version. Modern processors have a feature called store-to-load forwarding. This feature kicks in when a memory load match a previous memory store, and shortcut the memory operation to use the value already known. In this case GCC somehow use an asymmetric load/store for parameter A and B. A is copied using 4+4+1 bytes, and B is copied using 8+1 bytes. This means the 8 first bytes of the class won't be matched by the store-to-load forwarding, losing a precious CPU optimization. To check this, I manually replaced the 8+1 copy by a 4+4+1 copy, and the performance went up as expected (code below). In the end, GCC is at fault for not considering this.
GCC patched code, longer but taking advantage of store forwarding (perf 220ms) :
00402F90 44 8B 01 mov r8d,dword ptr [rcx] 00402F93 F3 0F 10 41 04 movss xmm0,dword ptr [rcx+4] 00402F98 0F B6 41 08 movzx eax,byte ptr [rcx+8]00402F9C 4C 8B 0A mov r9,qword ptr [rdx]00402F9F 4C 89 09 mov qword ptr [rcx],r900402F9C 44 8B 0A mov r9d,dword ptr [rdx] 00402F9F 44 89 09 mov dword ptr [rcx],r9d 00402FA2 44 8B 4A 04 mov r9d,dword ptr [rdx+4] 00402FA6 44 89 49 04 mov dword ptr [rcx+4],r9d 00402FAA 44 0F B6 4A 08 movzx r9d,byte ptr [rdx+8] 00402FAF 44 88 49 08 mov byte ptr [rcx+8],r9b 00402FB3 44 89 02 mov dword ptr [rdx],r8d 00402FB6 F3 0F 11 42 04 movss dword ptr [rdx+4],xmm0 00402FBB 88 42 08 mov byte ptr [rdx+8],al
Actually, this copy instructions (symmetric 4+4+1) is the right way to do it. In these test we are only doing copies, in which case MSVC version is the best without doubt. The problem is that in a real case, the class member will be accessed individually, thus generating 4 bytes read/write. MSVC 8 bytes batch copy (also generated by GCC for one argument) will prevent future store forwarding for individual members. A new test I did with member operation beside copies show that the patched 4+4+1 versions indeed outperform all the others. And by a factor of nearly x2. Sadly, no modern compiler is generating this code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With