I was experimenting with different integer types in Visual Studio project in Windows using a simple exchange sort algorithm below. The processor is Intel. The code was compiled in Release x64. The optimization setting is "Maximize Speed (/O2)". The command line corresponding to the compilation settings is <pre class="prettyprint"><code>/permissive- /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\SpeedTestForIntegerTypes.pch" /diagnostics:classic </code></pre> The code itself: <pre class="prettyprint"><code>#include <ctime> #include <vector> #include <iostream> void sort(int N, int A[], int WorkArray[]) // exchange sort { int i, j, index, val_min; for (j = 0; j < N; j++) { val_min = 500000; for (i = j; i < N; i++) { if (A[i] < val_min) { val_min = A[i]; index = i; } } WorkArray[j] = A[j]; A[j] = val_min; A[index] = WorkArray[j]; } } int main() { std::vector<int> A(400000), WorkArray(400000); for(size_t k = 0; k < 400000; k++) A[k] = 400000 - (k+1); clock_t begin = clock(); sort(400000, &A[0], &WorkArray[0]); clock_t end = clock(); double sortTime = double(end - begin) / CLOCKS_PER_SEC; std::cout << "Sort time: " << sortTime << std::endl; return 0; } </code></pre> The <code>WorkArray</code> is only needed to save the vector before sorting. The point is, this sorting took me 22.3 seconds to complete. The interesting part is that if I change type <code>int</code> to <code>size_t</code> for arrays <code>A</code>, <code>WorkArray</code> (both in <code>std::vector</code> and in the argument list of function <code>sort</code>), as well as for <code>val_min</code>, the time increases to 67.4! This is threefold slower! The new code is below: <pre class="prettyprint"><code>#include <ctime> #include <vector> #include <iostream> void sort(int N, size_t A[], size_t WorkArray[]) // exchange sort { int i, j, index; size_t val_min; for (j = 0; j < N; j++) { val_min = 500000U; for (i = j; i < N; i++) { if (A[i] < val_min) { val_min = A[i]; index = i; } } WorkArray[j] = A[j]; A[j] = val_min; A[index] = WorkArray[j]; } } int main() { std::vector<size_t> A(400000), WorkArray(400000); for(size_t k = 0; k < 400000; k++) A[k] = 400000 - (k+1); clock_t begin = clock(); sort(400000, &A[0], &WorkArray[0]); clock_t end = clock(); double sortTime = double(end - begin) / CLOCKS_PER_SEC; std::cout << "Sort time: " << sortTime << std::endl; return 0; } </code></pre> Note that I still keep type <code>int</code> for function local variables <code>i</code>, <code>j</code>, <code>index</code>, <code>N</code>, and so the only two arithmetical operations that are <code>i++</code> and <code>j++</code> should take the same amount of time to perform in both cases. Therefore, this slowdown has to do with other reasons. Is it related to the memory alignment issue or register sizes or something else? But the most outrageous part was when I changed <code>int</code> to <code>unsigned int</code>. Both <code>unsigned int</code> and <code>int</code> occupy the same number of bytes which is 4 (<code>sizeof</code> showed that). But the runtime for <code>unsigned int</code> was 65.8 s! While the first outcome was somewhat ok to accept, the second one totally confuses me! Why is there such a significant difference in time it takes to run such a simple algorithm that does not even involve sign checks? Thanks to all addressing both of these questions. Where can I start reading more about these hardware-level optimization peculiarities? I don't care about the sorting algorithm itself, it's here for illustration of the problem only. UPDATE: once again, I stress the fact that I use ints for array indices in all three cases.

Inspecting the generated assembly for all 3 variants (<code>int</code>, <code>unsigned</code>, <code>size_t</code>), the big difference is that in the <code>int</code> case the loop in the <code>sort</code> function is unrolled and uses SSE instructions (working on 8 ints at a time), while in the other 2 cases it does neither. Interestingly enough, the <code>sort</code> function is called in the <code>int</code> case, while it is inlined into <code>main</code> in the other two (likely due to the increased size of the function due to the loop unrolling). I'm compiling from the command line using <code>cl /nologo /W4 /MD /EHsc /Zi /Ox</code>, using <code>dumpbin</code> to get the disassembly, with toolset <code>Microsoft (R) C/C++ Optimizing Compiler Version 19.12.25830.2 for x64</code>. I get execution times of around 30 seconds for <code>int</code> and 100 seconds for the other two.

I tried this code in VS2017. I succeeded in reproducing. I modified the code as follows so that the time is almost the same. The cause seems to be due to the implicit casting of the array index. <pre class="prettyprint"><code>#include <ctime> #include <vector> #include <iostream> using namespace std; // exchange sort template<typename elem_t, typename index_t> void sort(index_t size, elem_t* a, elem_t* b) { index_t index = 0, i, j; elem_t min; for (j = 0; j < size; j++) { min = 500000; for (i = j; i < size; i++) { if (a[i] < min) { min = a[i]; index = i; } } b[j] = a[j]; a[j] = min; a[index] = b[j]; } } template<typename elem_t, typename index_t, index_t size> void test() { //vector<elem_t> a(size); //vector<elem_t> b(size); elem_t a[size]; elem_t b[size]; for (index_t k = 0; k < size; k++) a[k] = (elem_t)(size - (k + 1)); clock_t begin = clock(); sort(size, &a[0], &b[0]); clock_t end = clock(); double sortTime = double(end - begin) / CLOCKS_PER_SEC; cout << "Sort time: " << sortTime << endl; } int main() { const int size = 40000; cout << "<size_t, int>" << endl; test<size_t, int, size>(); cout << endl; cout << "<size_t, size_t>" << endl; test<size_t, size_t, size>(); cout << endl; cout << "<int, int>" << endl; test<int, int, size>(); cout << endl; cout << "<int, size_t>" << endl; test<int, size_t, size>(); cout << endl; cout << "<uint, int>" << endl; test<unsigned int, int, size>(); cout << endl; cout << "<uint, size_t>" << endl; test<unsigned int, size_t, size>(); cout << endl; cout << "<uint, uint>" << endl; test<unsigned int, unsigned int, size>(); cout << endl; } </code></pre> Personally, I do not like implicit casting. For troubleshooting this kind of problem, increase the warning level to the maximum, and resolve all warnings, and then convert to generic code. This will help you identify the problem. The result of this code appears as a result of various combinations. <ul> <li>signed vs unsigned: In C, why is "signed int" faster than "unsigned int"?</li> <li>type size (int32 vs int64)</li> <li>array index assembly code</li> <li> vc++ optimization: /O2 (Maximum Optimization (Favor Speed)) <ul> <li>this make fast for (int / int).</li> </ul> </li> </ul>

Why are size_t and unsigned int slower than int?

Tags:

c++

performance

int

size-t

I was experimenting with different integer types in Visual Studio project in Windows using a simple exchange sort algorithm below. The processor is Intel. The code was compiled in Release x64. The optimization setting is "Maximize Speed (/O2)". The command line corresponding to the compilation settings is

/permissive- /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\SpeedTestForIntegerTypes.pch" /diagnostics:classic

The code itself:

#include <ctime>
#include <vector>
#include <iostream>

void sort(int N, int A[], int WorkArray[]) // exchange sort
{
    int i, j, index, val_min;
    for (j = 0; j < N; j++)
    {
        val_min = 500000;
        for (i = j; i < N; i++)
        {
            if (A[i] < val_min)
            {
                val_min = A[i];
                index = i;
            }
        }
        WorkArray[j] = A[j];
        A[j] = val_min;
        A[index] = WorkArray[j];
    }
}

int main()
{
    std::vector<int> A(400000), WorkArray(400000);
    for(size_t k = 0; k < 400000; k++)
        A[k] = 400000 - (k+1);

    clock_t begin = clock();

    sort(400000, &A[0], &WorkArray[0]);

    clock_t end = clock();
    double sortTime = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << "Sort time: " << sortTime << std::endl;
    return 0;
}

The WorkArray is only needed to save the vector before sorting. The point is, this sorting took me 22.3 seconds to complete. The interesting part is that if I change type int to size_t for arrays A, WorkArray (both in std::vector and in the argument list of function sort), as well as for val_min, the time increases to 67.4! This is threefold slower! The new code is below:

#include <ctime>
#include <vector>
#include <iostream>

void sort(int N, size_t A[], size_t WorkArray[]) // exchange sort
{
    int i, j, index;
    size_t val_min;
    for (j = 0; j < N; j++)
    {
        val_min = 500000U;
        for (i = j; i < N; i++)
        {
            if (A[i] < val_min)
            {
                val_min = A[i];
                index = i;
            }
        }
        WorkArray[j] = A[j];
        A[j] = val_min;
        A[index] = WorkArray[j];
    }
}

int main()
{
    std::vector<size_t> A(400000), WorkArray(400000);
    for(size_t k = 0; k < 400000; k++)
        A[k] = 400000 - (k+1);

    clock_t begin = clock();

    sort(400000, &A[0], &WorkArray[0]);

    clock_t end = clock();
    double sortTime = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << "Sort time: " << sortTime << std::endl;
    return 0;
}

Note that I still keep type int for function local variables i, j, index, N, and so the only two arithmetical operations that are i++ and j++ should take the same amount of time to perform in both cases. Therefore, this slowdown has to do with other reasons. Is it related to the memory alignment issue or register sizes or something else?

But the most outrageous part was when I changed int to unsigned int. Both unsigned int and int occupy the same number of bytes which is 4 (sizeof showed that). But the runtime for unsigned int was 65.8 s! While the first outcome was somewhat ok to accept, the second one totally confuses me! Why is there such a significant difference in time it takes to run such a simple algorithm that does not even involve sign checks?

Thanks to all addressing both of these questions. Where can I start reading more about these hardware-level optimization peculiarities? I don't care about the sorting algorithm itself, it's here for illustration of the problem only.

UPDATE: once again, I stress the fact that I use ints for array indices in all three cases.

986

asked Mar 16 '18 17:03

MajinSaha

Video Answer

2 Answers

Inspecting the generated assembly for all 3 variants (int, unsigned, size_t), the big difference is that in the int case the loop in the sort function is unrolled and uses SSE instructions (working on 8 ints at a time), while in the other 2 cases it does neither. Interestingly enough, the sort function is called in the int case, while it is inlined into main in the other two (likely due to the increased size of the function due to the loop unrolling).

I'm compiling from the command line using cl /nologo /W4 /MD /EHsc /Zi /Ox, using dumpbin to get the disassembly, with toolset Microsoft (R) C/C++ Optimizing Compiler Version 19.12.25830.2 for x64.

I get execution times of around 30 seconds for int and 100 seconds for the other two.

answered Oct 13 '22 13:10

1201ProgramAlarm

I tried this code in VS2017. I succeeded in reproducing.

I modified the code as follows so that the time is almost the same.

The cause seems to be due to the implicit casting of the array index.

#include <ctime>
#include <vector>
#include <iostream>

using namespace std;

// exchange sort
template<typename elem_t, typename index_t>
void sort(index_t size, elem_t* a, elem_t* b)
{
    index_t index = 0, i, j;
    elem_t min;

    for (j = 0; j < size; j++)
    {
        min = 500000;
        for (i = j; i < size; i++)
        {
            if (a[i] < min)
            {
                min = a[i];
                index = i;
            }
        }
        b[j] = a[j];
        a[j] = min;
        a[index] = b[j];
    }
}

template<typename elem_t, typename index_t, index_t size>
void test() {
    //vector<elem_t> a(size);
    //vector<elem_t> b(size);

    elem_t a[size];
    elem_t b[size];

    for (index_t k = 0; k < size; k++)
        a[k] = (elem_t)(size - (k + 1));

    clock_t begin = clock();
    sort(size, &a[0], &b[0]);
    clock_t end = clock();

    double sortTime = double(end - begin) / CLOCKS_PER_SEC;
    cout << "Sort time: " << sortTime << endl;
}

int main()
{
    const int size = 40000;

    cout << "<size_t, int>" << endl;
    test<size_t, int, size>();
    cout << endl;

    cout << "<size_t, size_t>" << endl;
    test<size_t, size_t, size>();
    cout << endl;

    cout << "<int, int>" << endl;
    test<int, int, size>();
    cout << endl;

    cout << "<int, size_t>" << endl;
    test<int, size_t, size>();
    cout << endl;

    cout << "<uint, int>" << endl;
    test<unsigned int, int, size>();
    cout << endl;

    cout << "<uint, size_t>" << endl;
    test<unsigned int, size_t, size>();
    cout << endl;

    cout << "<uint, uint>" << endl;
    test<unsigned int, unsigned int, size>();
    cout << endl;
}

Personally, I do not like implicit casting. For troubleshooting this kind of problem, increase the warning level to the maximum, and resolve all warnings, and then convert to generic code. This will help you identify the problem.

The result of this code appears as a result of various combinations.

signed vs unsigned: In C, why is "signed int" faster than "unsigned int"?
type size (int32 vs int64)
array index assembly code
vc++ optimization: /O2 (Maximum Optimization (Favor Speed))
- this make fast for (int / int).

answered Oct 13 '22 13:10

Minos

Related questions
                            
                                Determining Optimum Thread Count
                            
                                using istream to read from named pipe
                            
                                Idiom to initialize C++ class to zero
                            
                                Smart pointers - unique_ptr for a stack-allocated variable
                            
                                How to find window handle from exe file's name
                            
                                Configure error with Qt 5.8 and SQL libraries
                            
                                How to automatically increase/decrease text size in label in Qt
                            
                                Why does Array elements have to be of the same type
                            
                                Is it possible to embed C++ widget to PyQt application?
                            
                                C++ templates: How to conditionally compile different code based on data type?
                            
                                Calculate binomial coffeficient very reliably
                            
                                How to output c++ type information during compilation
                            
                                Cpp Friend function has no access to private static members
                            
                                Correct template for nested class C++
                            
                                Calling a function from a DLL which is developed in C++ from C
                            
                                Xcode 9 falls to build partial template specialization in c++
                            
                                C++ using ifstream to read file
                            
                                dynamically sized classes in c++
                            
                                Is vector in struct time consuming? Is it better to use a pointer?
                            
                                PyArray_Check gives Segmentation Fault with Cython/C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With