I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.
Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?
I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?
The programs I used as follows, they are not identical by any means:
#include <iostream> #include <cmath> #include <cstdlib> #include <time.h> int main( int argc, char** argv ) { int accum = 0; srand( time( NULL ) ); for( unsigned int i = 0; i < 100000000; ++i ) { accum += rand( ) % 365; } std::cout << accum << std::endl; return 0; }
Program 2:
#include <iostream> #include <cmath> #include <cstdlib> #include <time.h> int main( int argc, char** argv ) { float accum = 0; srand( time( NULL ) ); for( unsigned int i = 0; i < 100000000; ++i ) { accum += (float)( rand( ) % 365 ); } std::cout << accum << std::endl; return 0; }
Thanks in advance!
Edit: The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.
Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.
Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).
In computer systems, integer arithmetic is exact, but the possible range of values is limited. Integer arithmetic is generally faster than floating-point arithmetic. Floating-point numbers represent what were called in school “real” numbers (i.e., those that have a fractional part, such as 3.1415927).
Integers and floats are two different kinds of numerical data. An integer (more commonly called an int) is a number without a decimal point. A float is a floating-point number, which means it is a number that has a decimal place. Floats are used when more precision is needed.
Floating point numbers provide relative accuracy: they can represent numbers that are at most a small percentage (if you want to call something like 0.0000000000001% a percentage) away from any accurate number over a wide range of numbers.
Floating point is about the same speed. The difference is when you use the result of a computation in memory address indexing - in that case, floating point tends to go through some rather slow CPU pathways for conversion and addressing.
For example (lesser numbers are faster),
64-bit Intel Xeon X5550 @ 2.67GHz, gcc 4.1.2 -O3
short add/sub: 1.005460 [0] short mul/div: 3.926543 [0] long add/sub: 0.000000 [0] long mul/div: 7.378581 [0] long long add/sub: 0.000000 [0] long long mul/div: 7.378593 [0] float add/sub: 0.993583 [0] float mul/div: 1.821565 [0] double add/sub: 0.993884 [0] double mul/div: 1.988664 [0]
32-bit Dual Core AMD Opteron(tm) Processor 265 @ 1.81GHz, gcc 3.4.6 -O3
short add/sub: 0.553863 [0] short mul/div: 12.509163 [0] long add/sub: 0.556912 [0] long mul/div: 12.748019 [0] long long add/sub: 5.298999 [0] long long mul/div: 20.461186 [0] float add/sub: 2.688253 [0] float mul/div: 4.683886 [0] double add/sub: 2.700834 [0] double mul/div: 4.646755 [0]
As Dan pointed out, even once you normalize for clock frequency (which can be misleading in itself in pipelined designs), results will vary wildly based on CPU architecture (individual ALU/FPU performance, as well as actual number of ALUs/FPUs available per core in superscalar designs which influences how many independent operations can execute in parallel -- the latter factor is not exercised by the code below as all operations below are sequentially dependent.)
Poor man's FPU/ALU operation benchmark:
#include <stdio.h> #ifdef _WIN32 #include <sys/timeb.h> #else #include <sys/time.h> #endif #include <time.h> #include <cstdlib> double mygettime(void) { # ifdef _WIN32 struct _timeb tb; _ftime(&tb); return (double)tb.time + (0.001 * (double)tb.millitm); # else struct timeval tv; if(gettimeofday(&tv, 0) < 0) { perror("oops"); } return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec); # endif } template< typename Type > void my_test(const char* name) { Type v = 0; // Do not use constants or repeating values // to avoid loop unroll optimizations. // All values >0 to avoid division by 0 // Perform ten ops/iteration to reduce // impact of ++i below on measurements Type v0 = (Type)(rand() % 256)/16 + 1; Type v1 = (Type)(rand() % 256)/16 + 1; Type v2 = (Type)(rand() % 256)/16 + 1; Type v3 = (Type)(rand() % 256)/16 + 1; Type v4 = (Type)(rand() % 256)/16 + 1; Type v5 = (Type)(rand() % 256)/16 + 1; Type v6 = (Type)(rand() % 256)/16 + 1; Type v7 = (Type)(rand() % 256)/16 + 1; Type v8 = (Type)(rand() % 256)/16 + 1; Type v9 = (Type)(rand() % 256)/16 + 1; double t1 = mygettime(); for (size_t i = 0; i < 100000000; ++i) { v += v0; v -= v1; v += v2; v -= v3; v += v4; v -= v5; v += v6; v -= v7; v += v8; v -= v9; } // Pretend we make use of v so compiler doesn't optimize out // the loop completely printf("%s add/sub: %f [%d]\n", name, mygettime() - t1, (int)v&1); t1 = mygettime(); for (size_t i = 0; i < 100000000; ++i) { v /= v0; v *= v1; v /= v2; v *= v3; v /= v4; v *= v5; v /= v6; v *= v7; v /= v8; v *= v9; } // Pretend we make use of v so compiler doesn't optimize out // the loop completely printf("%s mul/div: %f [%d]\n", name, mygettime() - t1, (int)v&1); } int main() { my_test< short >("short"); my_test< long >("long"); my_test< long long >("long long"); my_test< float >("float"); my_test< double >("double"); return 0; }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With