I was experimenting with different integer types in Visual Studio project in Windows using a simple exchange sort algorithm below. The processor is Intel. The code was compiled in Release x64. The optimization setting is "Maximize Speed (/O2)". The command line corresponding to the compilation settings is
/permissive- /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"x64\Release\vc141.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Fp"x64\Release\SpeedTestForIntegerTypes.pch" /diagnostics:classic
The code itself:
#include <ctime>
#include <vector>
#include <iostream>
void sort(int N, int A[], int WorkArray[]) // exchange sort
{
int i, j, index, val_min;
for (j = 0; j < N; j++)
{
val_min = 500000;
for (i = j; i < N; i++)
{
if (A[i] < val_min)
{
val_min = A[i];
index = i;
}
}
WorkArray[j] = A[j];
A[j] = val_min;
A[index] = WorkArray[j];
}
}
int main()
{
std::vector<int> A(400000), WorkArray(400000);
for(size_t k = 0; k < 400000; k++)
A[k] = 400000 - (k+1);
clock_t begin = clock();
sort(400000, &A[0], &WorkArray[0]);
clock_t end = clock();
double sortTime = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "Sort time: " << sortTime << std::endl;
return 0;
}
The WorkArray
is only needed to save the vector before sorting.
The point is, this sorting took me 22.3 seconds to complete. The interesting part is that if I change type int
to size_t
for arrays A
, WorkArray
(both in std::vector
and in the argument list of function sort
), as well as for val_min
, the time increases to 67.4! This is threefold slower! The new code is below:
#include <ctime>
#include <vector>
#include <iostream>
void sort(int N, size_t A[], size_t WorkArray[]) // exchange sort
{
int i, j, index;
size_t val_min;
for (j = 0; j < N; j++)
{
val_min = 500000U;
for (i = j; i < N; i++)
{
if (A[i] < val_min)
{
val_min = A[i];
index = i;
}
}
WorkArray[j] = A[j];
A[j] = val_min;
A[index] = WorkArray[j];
}
}
int main()
{
std::vector<size_t> A(400000), WorkArray(400000);
for(size_t k = 0; k < 400000; k++)
A[k] = 400000 - (k+1);
clock_t begin = clock();
sort(400000, &A[0], &WorkArray[0]);
clock_t end = clock();
double sortTime = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "Sort time: " << sortTime << std::endl;
return 0;
}
Note that I still keep type int
for function local variables i
, j
, index
, N
, and so the only two arithmetical operations that are i++
and j++
should take the same amount of time to perform in both cases. Therefore, this slowdown has to do with other reasons. Is it related to the memory alignment issue or register sizes or something else?
But the most outrageous part was when I changed int
to unsigned int
. Both unsigned int
and int
occupy the same number of bytes which is 4 (sizeof
showed that). But the runtime for unsigned int
was 65.8 s! While the first outcome was somewhat ok to accept, the second one totally confuses me! Why is there such a significant difference in time it takes to run such a simple algorithm that does not even involve sign checks?
Thanks to all addressing both of these questions. Where can I start reading more about these hardware-level optimization peculiarities? I don't care about the sorting algorithm itself, it's here for illustration of the problem only.
UPDATE: once again, I stress the fact that I use ints for array indices in all three cases.
One downside is that on 64-bit Windows size_t is unsigned long long because long is just 32 bits (even in 64-bit mode).
If we consider the standard, both are integers of size 16 bits. On a typical 64-bit system, the size_t will be 64-bit, but unsigned int will be 32 bit. So we cannot use them interchangeably. One standard recommendation is that the size_t be at most as big as an unsigned long.
When writing C code you should always use size_t whenever dealing with memory ranges. The int type on the other hand is basically defined as the size of the (signed) integer value that the host machine can use to most efficiently perform integer arithmetic.
In C, why is signed int faster than unsigned int ? True, I know that this has been asked and answered multiple times on this website (links below). However, most people said that there is no difference. I have written code and accidentally found a significant performance difference.
Inspecting the generated assembly for all 3 variants (int
, unsigned
, size_t
), the big difference is that in the int
case the loop in the sort
function is unrolled and uses SSE instructions (working on 8 ints at a time), while in the other 2 cases it does neither. Interestingly enough, the sort
function is called in the int
case, while it is inlined into main
in the other two (likely due to the increased size of the function due to the loop unrolling).
I'm compiling from the command line using cl /nologo /W4 /MD /EHsc /Zi /Ox
, using dumpbin
to get the disassembly, with toolset Microsoft (R) C/C++ Optimizing Compiler Version 19.12.25830.2 for x64
.
I get execution times of around 30 seconds for int
and 100 seconds for the other two.
I tried this code in VS2017. I succeeded in reproducing.
I modified the code as follows so that the time is almost the same.
The cause seems to be due to the implicit casting of the array index.
#include <ctime>
#include <vector>
#include <iostream>
using namespace std;
// exchange sort
template<typename elem_t, typename index_t>
void sort(index_t size, elem_t* a, elem_t* b)
{
index_t index = 0, i, j;
elem_t min;
for (j = 0; j < size; j++)
{
min = 500000;
for (i = j; i < size; i++)
{
if (a[i] < min)
{
min = a[i];
index = i;
}
}
b[j] = a[j];
a[j] = min;
a[index] = b[j];
}
}
template<typename elem_t, typename index_t, index_t size>
void test() {
//vector<elem_t> a(size);
//vector<elem_t> b(size);
elem_t a[size];
elem_t b[size];
for (index_t k = 0; k < size; k++)
a[k] = (elem_t)(size - (k + 1));
clock_t begin = clock();
sort(size, &a[0], &b[0]);
clock_t end = clock();
double sortTime = double(end - begin) / CLOCKS_PER_SEC;
cout << "Sort time: " << sortTime << endl;
}
int main()
{
const int size = 40000;
cout << "<size_t, int>" << endl;
test<size_t, int, size>();
cout << endl;
cout << "<size_t, size_t>" << endl;
test<size_t, size_t, size>();
cout << endl;
cout << "<int, int>" << endl;
test<int, int, size>();
cout << endl;
cout << "<int, size_t>" << endl;
test<int, size_t, size>();
cout << endl;
cout << "<uint, int>" << endl;
test<unsigned int, int, size>();
cout << endl;
cout << "<uint, size_t>" << endl;
test<unsigned int, size_t, size>();
cout << endl;
cout << "<uint, uint>" << endl;
test<unsigned int, unsigned int, size>();
cout << endl;
}
Personally, I do not like implicit casting. For troubleshooting this kind of problem, increase the warning level to the maximum, and resolve all warnings, and then convert to generic code. This will help you identify the problem.
The result of this code appears as a result of various combinations.
signed vs unsigned: In C, why is "signed int" faster than "unsigned int"?
type size (int32 vs int64)
array index assembly code
vc++ optimization: /O2 (Maximum Optimization (Favor Speed))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With