Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reimplementing std::swap() with static tmp variable for simple types C++

I decided to benchmark the realization of a swap function for simple types (like int, or struct, or a class that uses only simple types in its fields) with a static tmp variable in it to prevent memory allocation in each swap call. So I wrote this simple test program:

#include <iostream>
#include <chrono>
#include <utility>
#include <vector>


template<typename T>
void mySwap(T& a, T& b)     //Like std::swap - just for tests
{
    T tmp = std::move(a);
    a = std::move(b);
    b = std::move(tmp);
}

template<typename T>
void mySwapStatic(T& a, T& b)   //Here with static tmp
{
    static T tmp;
    tmp = std::move(a);
    a = std::move(b);
    b = std::move(tmp);
}

class Test1 {       //Simple class with some simple types
    int foo;
    float bar;
    char bazz;
};

class Test2 {       //Class with std::vector in it
    int foo;
    float bar;
    char bazz;
    std::vector<int> bizz;
public:
    Test2()
    {
        bizz = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
    }
};

#define Test Test1      //choosing class

const static unsigned int NUM_TESTS = 100000000;
static Test a, b;   //making it static to prevent throwing out from code by compiler optimizations

template<typename T, typename F>
auto test(unsigned int numTests, T& a, T& b, const F swapFunction )     //test function
{
    std::chrono::system_clock::time_point t1, t2;
    t1 = std::chrono::system_clock::now();
    for(unsigned int i = 0; i < NUM_TESTS; ++i)    {
        swapFunction(a, b);
    }
    t2 = std::chrono::system_clock::now();
    return std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
}

int main()
{
    std::chrono::system_clock::time_point t1, t2;
    std::cout << "Test 1. MySwap Result:\t\t" << test(NUM_TESTS, a, b, mySwap<Test>) << " nanoseconds\n";   //caling test function
    t1 = std::chrono::system_clock::now();
    for(unsigned int i = 0; i < NUM_TESTS; ++i)    {
        mySwap<Test>(a, b);
    }
    t2 = std::chrono::system_clock::now();
    std::cout << "Test 2. MySwap2 Result:\t\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //This result slightly better then 1. why?!
    std::cout << "Test 3. MySwapStatic Result:\t" << test(NUM_TESTS, a, b, mySwapStatic<Test>) << " nanoseconds\n"; //test function with mySwapStatic
    t1 = std::chrono::system_clock::now();
    for(unsigned int i = 0; i < NUM_TESTS; ++i)    {
        mySwapStatic<Test>(a, b);
    }
    t2 = std::chrono::system_clock::now();
    std::cout << "Test 4. MySwapStatic2 Result:\t" << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count() << " nanoseconds\n"; //And again - it's better then 3...
    std::cout << "Test 5. std::swap Result:\t" << test(NUM_TESTS, a, b, std::swap<Test>) << " nanoseconds\n";   //calling test function with std::swap for comparsion. Mostly similar to 1...
    return 0;
}

Some results with Test defined as Test1 (g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2 called as g++ main.cpp -O3 -std=c++11):

Test 1. MySwap Result: 625,105,480 nanoseconds

Test 2. MySwap2 Result: 528,701,547 nanoseconds

Test 3. MySwapStatic Result: 338,484,180 nanoseconds

Test 4. MySwapStatic2 Result: 228,228,156 nanoseconds

Test 5. std::swap Result: 564,863,184 nanoseconds

My main question: is it good to use this implementation for swapping of simple types? I know that if you use it for swapping types with vectors, for example, then std::swap is better, and you can see it just by changing the Test define to Test2.

Second question: why are the results in test 1, 2, 3, and 4 so different? What am I doing wrong with the test function implementation?

like image 850
dFionov Avatar asked Feb 11 '23 13:02

dFionov


1 Answers

An answer to your second question first : in your test 2 and 4, the compiler is inlining the functions, thus it gives better performances (there is even more on test 4, but I will cover this later).

Overall, it is probably a bad idea to use a static temp variable.

Why ? First, it should be noted that in x86 assembly, there is no instruction to copy from memory to memory. This means that when you swap, there is, not one, but two temporary variables in CPU registers. And these temporary variables MUST be in CPU registers, you can't copy mem to mem, thus a static variable will add a third memory location to transfer to and from.

One problem with your static temp is that it will hinder inlining. Imagine if the variables you swap are already in CPU registers. In this case, the compiler can inline the swapping, and never copy anything to memory, which is much faster. Now, if you forced a static temp, either the compiler removes it (useless), or is forced to add a memory copy. That's what happen in test 4, in which GCC removed all the reads to the static variable. It just pointlessly write updated values to it because you told it do to so. The read removal explains the good performance gain, but it could be even faster.

Your test cases are flawed because they don't show this point.

Now you may ask : Then why my static functions perform better ? I have no idea. (Answer at the end)

I was curious, so I compiled your code with MSVC, and it turns out MSVC is doing it right, and GCC is doing it weird. At O2 optimization levels, MSVC detects that two swaps is a no-op and shortcut them, but even at O1, the non-inlined generated code is faster than all the test case with GCC at O3. (EDIT: Actually, MSVC is not doing it right either, see explanation at the end.)

The assembly generated by MSVC looks indeed better, but when comparing the static and non-static assembly generated by GCC, I don't know why the static performs better.

Anyway, I think even if GCC is generating weird code, the inlining problem should be worth using the std::swap, because with bigger types the additional memory copy could be costly, and smaller types give better inlining.


Here are the assembly produced by all the test cases, if someone have an idea of why the GCC static performs better than the non-static, despite being longer and using more memory moves. EDIT: Answer at the end

GCC non-static (perf 570ms):

    00402F90 44 8B 01             mov         r8d,dword ptr [rcx]
    00402F93 F3 0F 10 41 04       movss       xmm0,dword ptr [rcx+4]
    00402F98 0F B6 41 08          movzx       eax,byte ptr [rcx+8] 
    00402F9C 4C 8B 0A             mov         r9,qword ptr [rdx]
    00402F9F 4C 89 09             mov         qword ptr [rcx],r9
    00402FA2 44 0F B6 4A 08       movzx       r9d,byte ptr [rdx+8]
    00402FA7 44 88 49 08          mov         byte ptr [rcx+8],r9b
    00402FAB 44 89 02             mov         dword ptr [rdx],r8d 
    00402FAE F3 0F 11 42 04       movss       dword ptr [rdx+4],xmm0
    00402FB3 88 42 08             mov         byte ptr [rdx+8],al

GCC static and MSVC static (perf 275ms):

    00402F10 48 8B 01             mov         rax,qword ptr [rcx]  
    00402F13 48 89 05 66 11 00 00 mov         qword ptr [404080h],rax  
    00402F1A 0F B6 41 08          movzx       eax,byte ptr [rcx+8]  
    00402F1E 88 05 64 11 00 00    mov         byte ptr [404088h],al  
    00402F24 48 8B 02             mov         rax,qword ptr [rdx]  
    00402F27 48 89 01             mov         qword ptr [rcx],rax  
    00402F2A 0F B6 42 08          movzx       eax,byte ptr [rdx+8]  
    00402F2E 88 41 08             mov         byte ptr [rcx+8],al  
    00402F31 48 8B 05 48 11 00 00 mov         rax,qword ptr [404080h]  
    00402F38 48 89 02             mov         qword ptr [rdx],rax  
    00402F3B 0F B6 05 46 11 00 00 movzx       eax,byte ptr [404088h]  
    00402F42 88 42 08             mov         byte ptr [rdx+8],al  

MSVC non-static (perf 215ms):

    00000   f2 0f 10 02  movsdx  xmm0, QWORD PTR [rdx]
    00004   f2 0f 10 09  movsdx  xmm1, QWORD PTR [rcx]
    00008   44 8b 41 08  mov     r8d, DWORD PTR [rcx+8]
    0000c   f2 0f 11 01  movsdx  QWORD PTR [rcx], xmm0
    00010   8b 42 08     mov     eax, DWORD PTR [rdx+8]
    00013   89 41 08     mov     DWORD PTR [rcx+8], eax
    00016   f2 0f 11 0a  movsdx  QWORD PTR [rdx], xmm1
    0001a   44 89 42 08  mov     DWORD PTR [rdx+8], r8d

std::swap versions are all identical to the non-static versions.


After having some fun investigating, I found the likely reason of the bad performance of the GCC non-static version. Modern processors have a feature called store-to-load forwarding. This feature kicks in when a memory load match a previous memory store, and shortcut the memory operation to use the value already known. In this case GCC somehow use an asymmetric load/store for parameter A and B. A is copied using 4+4+1 bytes, and B is copied using 8+1 bytes. This means the 8 first bytes of the class won't be matched by the store-to-load forwarding, losing a precious CPU optimization. To check this, I manually replaced the 8+1 copy by a 4+4+1 copy, and the performance went up as expected (code below). In the end, GCC is at fault for not considering this.

GCC patched code, longer but taking advantage of store forwarding (perf 220ms) :

00402F90 44 8B 01             mov         r8d,dword ptr [rcx]  
00402F93 F3 0F 10 41 04       movss       xmm0,dword ptr [rcx+4]  
00402F98 0F B6 41 08          movzx       eax,byte ptr [rcx+8]

00402F9C 4C 8B 0A             mov         r9,qword ptr [rdx]
00402F9F 4C 89 09             mov         qword ptr [rcx],r9
00402F9C 44 8B 0A             mov         r9d,dword ptr [rdx]
00402F9F 44 89 09             mov         dword ptr [rcx],r9d
00402FA2 44 8B 4A 04          mov         r9d,dword ptr [rdx+4]
00402FA6 44 89 49 04          mov         dword ptr [rcx+4],r9d

00402FAA 44 0F B6 4A 08       movzx       r9d,byte ptr [rdx+8]  
00402FAF 44 88 49 08          mov         byte ptr [rcx+8],r9b  
00402FB3 44 89 02             mov         dword ptr [rdx],r8d  
00402FB6 F3 0F 11 42 04       movss       dword ptr [rdx+4],xmm0  
00402FBB 88 42 08             mov         byte ptr [rdx+8],al

Actually, this copy instructions (symmetric 4+4+1) is the right way to do it. In these test we are only doing copies, in which case MSVC version is the best without doubt. The problem is that in a real case, the class member will be accessed individually, thus generating 4 bytes read/write. MSVC 8 bytes batch copy (also generated by GCC for one argument) will prevent future store forwarding for individual members. A new test I did with member operation beside copies show that the patched 4+4+1 versions indeed outperform all the others. And by a factor of nearly x2. Sadly, no modern compiler is generating this code.

like image 143
ElderBug Avatar answered Feb 13 '23 04:02

ElderBug