Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cast performance from size_t to double

TL;DR: Why is multiplying/casting data in size_t slow and why does this vary per platform?

I'm having some performance issues that I don't fully understand. The context is a camera frame grabber where a 128x128 uint16_t image is read and post-processed at a rate of several 100 Hz.

In the post-processing I generate a histogram frame->histo which is of uint32_t and has thismaxval = 2^16 elements, basically I tally all intensity values. Using this histogram I calculate the sum and squared sum:

double sum=0, sumsquared=0; size_t thismaxval = 1 << 16;  for(size_t i = 0; i < thismaxval; i++) {     sum += (double)i * frame->histo[i];     sumsquared += (double)(i * i) * frame->histo[i]; } 

Profiling the code with profile I got the following (samples, percentage, code):

 58228 32.1263 :  sum += (double)i * frame->histo[i]; 116760 64.4204 :  sumsquared += (double)(i * i) * frame->histo[i]; 

or, the first line takes up 32% of CPU time, the second line 64%.

I did some benchmarking and it seems to be the datatype/casting that's problematic. When I change the code to

uint_fast64_t isum=0, isumsquared=0;  for(uint_fast32_t i = 0; i < thismaxval; i++) {     isum += i * frame->histo[i];     isumsquared += (i * i) * frame->histo[i]; } 

it runs ~10x faster. However, this performance hit also varies per platform. On the workstation, a Core i7 CPU 950 @ 3.07GHz the code is 10x faster. On my Macbook8,1, which has a Intel Core i7 Sandy Bridge 2.7 GHz (2620M) the code is only 2x faster.

Now I am wondering:

  1. Why is the original code so slow and easily sped up?
  2. Why does this vary per platform so much?

Update:

I compiled the above code with

g++ -O3  -Wall cast_test.cc -o cast_test 

Update2:

I ran the optimized codes through a profiler (Instruments on Mac, like Shark) and found two things:

1) The looping itself takes a considerable amount of time in some cases. thismaxval is of type size_t.

  1. for(size_t i = 0; i < thismaxval; i++) takes 17% of my total runtime
  2. for(uint_fast32_t i = 0; i < thismaxval; i++) takes 3.5%
  3. for(int i = 0; i < thismaxval; i++) does not show up in the profiler, I assume it's less than 0.1%

2) The datatypes and casting matter as follows:

  1. sumsquared += (double)(i * i) * histo[i]; 15% (with size_t i)
  2. sumsquared += (double)(i * i) * histo[i]; 36% (with uint_fast32_t i)
  3. isumsquared += (i * i) * histo[i]; 13% (with uint_fast32_t i, uint_fast64_t isumsquared)
  4. isumsquared += (i * i) * histo[i]; 11% (with int i, uint_fast64_t isumsquared)

Surprisingly, int is faster than uint_fast32_t?

Update4:

I ran some more tests with different datatypes and different compilers, on one machine. The results are as follows.

For testd 0 -- 2 the relevant code is

    for(loop_t i = 0; i < thismaxval; i++)         sumsquared += (double)(i * i) * histo[i]; 

with sumsquared a double, and loop_t size_t, uint_fast32_t and int for tests 0, 1 and 2.

For tests 3--5 the code is

    for(loop_t i = 0; i < thismaxval; i++)         isumsquared += (i * i) * histo[i]; 

with isumsquared of type uint_fast64_t and loop_t again size_t, uint_fast32_t and int for tests 3, 4 and 5.

The compilers I used are gcc 4.2.1, gcc 4.4.7, gcc 4.6.3 and gcc 4.7.0. The timings are in percentages of total cpu time of the code, so they show relative performance, not absolute (although the runtime was quite constant at 21s). The cpu time is for both two lines, because I'm not quite sure if the profiler correctly separated the two lines of code.

 gcc:    4.2.1  4.4.7  4.6.3  4.7.0 ---------------------------------- test 0: 21.85  25.15  22.05  21.85 test 1: 21.9   25.05  22     22 test 2: 26.35  25.1   21.95  19.2 test 3: 7.15   8.35   18.55  19.95 test 4: 11.1   8.45   7.35   7.1 test 5: 7.1    7.8    6.9    7.05 

or:

casting performance

Based on this, it seems that casting is expensive, regardless of what integer type I use.

Also, it seems gcc 4.6 and 4.7 are not able to optimize loop 3 (size_t and uint_fast64_t) properly.

like image 864
Tim Avatar asked May 15 '12 07:05

Tim


People also ask

Can Size_t be a double?

Converting to double will not cause an overflow, but it could result in a loss of precision for a very large size_t value. Again, it doesn't make a lot of sense to convert a size_t to a double ; you're still better off keeping the value in a size_t variable.

Can you cast from a Size_t to int c?

Just use casting. You can safely convert size_t to int by checking its range before casting it. If it is outside the valid int range, you can't. Test against INT_MAX to ensure no possible signed integer overflow before casting it.

Can Size_t be float?

Since size_t must be an unsigned type, you first need to check explicitly if your float is negative: The behaviour on converting a float that's less than or equal to -1 to an unsigned type is undefined. The second job you need to do is to check that the float is within the upper range of size_t .


1 Answers

For your original questions:

  1. The code is slow because it involves the conversion from integer to float data types. That's why it's easily sped up when you use also an integer datatype for the sum-variables because it doesn't require a float-conversion anymore.
  2. The difference is the result of several factors. For example it depends on how efficient a platform is able to perform an int->float conversion. Furthermore this conversion could also mess up processor-internal optimizations in the program flow and prediction engine, caches, ... and also the internal parallelizing-features of the processors can have a huge influence in such calculations.

For the additional questions:

  • "Surprisingly int is faster than uint_fast32_t"? What's the sizeof(size_t) and sizeof(int) on your platform? One guess I can make is, that both are probably 64bit and therefore a cast to 32bit not only can give you calculation errors but also includes a different-size-casting penalty.

In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable. In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.

like image 148
Kosi2801 Avatar answered Oct 11 '22 03:10

Kosi2801