TL;DR: Why is multiplying/casting data in size_t
slow and why does this vary per platform?
I'm having some performance issues that I don't fully understand. The context is a camera frame grabber where a 128x128 uint16_t image is read and post-processed at a rate of several 100 Hz.
In the post-processing I generate a histogram frame->histo
which is of uint32_t
and has thismaxval
= 2^16 elements, basically I tally all intensity values. Using this histogram I calculate the sum and squared sum:
double sum=0, sumsquared=0; size_t thismaxval = 1 << 16; for(size_t i = 0; i < thismaxval; i++) { sum += (double)i * frame->histo[i]; sumsquared += (double)(i * i) * frame->histo[i]; }
Profiling the code with profile I got the following (samples, percentage, code):
58228 32.1263 : sum += (double)i * frame->histo[i]; 116760 64.4204 : sumsquared += (double)(i * i) * frame->histo[i];
or, the first line takes up 32% of CPU time, the second line 64%.
I did some benchmarking and it seems to be the datatype/casting that's problematic. When I change the code to
uint_fast64_t isum=0, isumsquared=0; for(uint_fast32_t i = 0; i < thismaxval; i++) { isum += i * frame->histo[i]; isumsquared += (i * i) * frame->histo[i]; }
it runs ~10x faster. However, this performance hit also varies per platform. On the workstation, a Core i7 CPU 950 @ 3.07GHz the code is 10x faster. On my Macbook8,1, which has a Intel Core i7 Sandy Bridge 2.7 GHz (2620M) the code is only 2x faster.
Now I am wondering:
Update:
I compiled the above code with
g++ -O3 -Wall cast_test.cc -o cast_test
Update2:
I ran the optimized codes through a profiler (Instruments on Mac, like Shark) and found two things:
1) The looping itself takes a considerable amount of time in some cases. thismaxval
is of type size_t
.
for(size_t i = 0; i < thismaxval; i++)
takes 17% of my total runtimefor(uint_fast32_t i = 0; i < thismaxval; i++)
takes 3.5%for(int i = 0; i < thismaxval; i++)
does not show up in the profiler, I assume it's less than 0.1%2) The datatypes and casting matter as follows:
sumsquared += (double)(i * i) * histo[i];
15% (with size_t i
)sumsquared += (double)(i * i) * histo[i];
36% (with uint_fast32_t i
)isumsquared += (i * i) * histo[i];
13% (with uint_fast32_t i
, uint_fast64_t isumsquared
)isumsquared += (i * i) * histo[i];
11% (with int i
, uint_fast64_t isumsquared
)Surprisingly, int
is faster than uint_fast32_t
?
Update4:
I ran some more tests with different datatypes and different compilers, on one machine. The results are as follows.
For testd 0 -- 2 the relevant code is
for(loop_t i = 0; i < thismaxval; i++) sumsquared += (double)(i * i) * histo[i];
with sumsquared
a double, and loop_t
size_t
, uint_fast32_t
and int
for tests 0, 1 and 2.
For tests 3--5 the code is
for(loop_t i = 0; i < thismaxval; i++) isumsquared += (i * i) * histo[i];
with isumsquared
of type uint_fast64_t
and loop_t
again size_t
, uint_fast32_t
and int
for tests 3, 4 and 5.
The compilers I used are gcc 4.2.1, gcc 4.4.7, gcc 4.6.3 and gcc 4.7.0. The timings are in percentages of total cpu time of the code, so they show relative performance, not absolute (although the runtime was quite constant at 21s). The cpu time is for both two lines, because I'm not quite sure if the profiler correctly separated the two lines of code.
gcc: 4.2.1 4.4.7 4.6.3 4.7.0 ---------------------------------- test 0: 21.85 25.15 22.05 21.85 test 1: 21.9 25.05 22 22 test 2: 26.35 25.1 21.95 19.2 test 3: 7.15 8.35 18.55 19.95 test 4: 11.1 8.45 7.35 7.1 test 5: 7.1 7.8 6.9 7.05
or:
Based on this, it seems that casting is expensive, regardless of what integer type I use.
Also, it seems gcc 4.6 and 4.7 are not able to optimize loop 3 (size_t and uint_fast64_t) properly.
Converting to double will not cause an overflow, but it could result in a loss of precision for a very large size_t value. Again, it doesn't make a lot of sense to convert a size_t to a double ; you're still better off keeping the value in a size_t variable.
Just use casting. You can safely convert size_t to int by checking its range before casting it. If it is outside the valid int range, you can't. Test against INT_MAX to ensure no possible signed integer overflow before casting it.
Since size_t must be an unsigned type, you first need to check explicitly if your float is negative: The behaviour on converting a float that's less than or equal to -1 to an unsigned type is undefined. The second job you need to do is to check that the float is within the upper range of size_t .
For your original questions:
For the additional questions:
In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable. In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With