TL;DR: Why is multiplying/casting data in <code>size_t</code> slow and why does this vary per platform? I'm having some performance issues that I don't fully understand. The context is a camera frame grabber where a 128x128 uint16_t image is read and post-processed at a rate of several 100 Hz. In the post-processing I generate a histogram <code>frame->histo</code> which is of <code>uint32_t</code> and has <code>thismaxval</code> = 2^16 elements, basically I tally all intensity values. Using this histogram I calculate the sum and squared sum: <pre class="prettyprint lang-c prettyprint-override"><code>double sum=0, sumsquared=0; size_t thismaxval = 1 << 16; for(size_t i = 0; i < thismaxval; i++) { sum += (double)i * frame->histo[i]; sumsquared += (double)(i * i) * frame->histo[i]; } </code></pre> Profiling the code with profile I got the following (samples, percentage, code): <pre class="prettyprint"><code> 58228 32.1263 : sum += (double)i * frame->histo[i]; 116760 64.4204 : sumsquared += (double)(i * i) * frame->histo[i]; </code></pre> or, the first line takes up 32% of CPU time, the second line 64%. I did some benchmarking and it seems to be the datatype/casting that's problematic. When I change the code to <pre class="prettyprint"><code>uint_fast64_t isum=0, isumsquared=0; for(uint_fast32_t i = 0; i < thismaxval; i++) { isum += i * frame->histo[i]; isumsquared += (i * i) * frame->histo[i]; } </code></pre> it runs ~10x faster. However, this performance hit also varies per platform. On the workstation, a Core i7 CPU 950 @ 3.07GHz the code is 10x faster. On my Macbook8,1, which has a Intel Core i7 Sandy Bridge 2.7 GHz (2620M) the code is only 2x faster. Now I am wondering: <ol> <li>Why is the original code so slow and easily sped up? </li> <li>Why does this vary per platform so much?</li> </ol> Update: I compiled the above code with <pre class="prettyprint"><code>g++ -O3 -Wall cast_test.cc -o cast_test </code></pre> Update2: I ran the optimized codes through a profiler (Instruments on Mac, like Shark) and found two things: 1) The looping itself takes a considerable amount of time in some cases. <code>thismaxval</code> is of type <code>size_t</code>. <ol> <li> <code>for(size_t i = 0; i < thismaxval; i++)</code> takes 17% of my total runtime</li> <li> <code>for(uint_fast32_t i = 0; i < thismaxval; i++)</code> takes 3.5%</li> <li> <code>for(int i = 0; i < thismaxval; i++)</code> does not show up in the profiler, I assume it's less than 0.1%</li> </ol> 2) The datatypes and casting matter as follows: <ol> <li> <code>sumsquared += (double)(i * i) * histo[i];</code> 15% (with <code>size_t i</code>)</li> <li> <code>sumsquared += (double)(i * i) * histo[i];</code> 36% (with <code>uint_fast32_t i</code>)</li> <li> <code>isumsquared += (i * i) * histo[i];</code> 13% (with <code>uint_fast32_t i</code>, <code>uint_fast64_t isumsquared</code>)</li> <li> <code>isumsquared += (i * i) * histo[i];</code> 11% (with <code>int i</code>, <code>uint_fast64_t isumsquared</code>)</li> </ol> Surprisingly, <code>int</code> is faster than <code>uint_fast32_t</code>? Update4: I ran some more tests with different datatypes and different compilers, on one machine. The results are as follows. For testd 0 -- 2 the relevant code is <pre class="prettyprint"><code> for(loop_t i = 0; i < thismaxval; i++) sumsquared += (double)(i * i) * histo[i]; </code></pre> with <code>sumsquared</code> a double, and <code>loop_t</code> <code>size_t</code>, <code>uint_fast32_t</code> and <code>int</code> for tests 0, 1 and 2. For tests 3--5 the code is <pre class="prettyprint"><code> for(loop_t i = 0; i < thismaxval; i++) isumsquared += (i * i) * histo[i]; </code></pre> with <code>isumsquared</code> of type <code>uint_fast64_t</code> and <code>loop_t</code> again <code>size_t</code>, <code>uint_fast32_t</code> and <code>int</code> for tests 3, 4 and 5. The compilers I used are gcc 4.2.1, gcc 4.4.7, gcc 4.6.3 and gcc 4.7.0. The timings are in percentages of total cpu time of the code, so they show relative performance, not absolute (although the runtime was quite constant at 21s). The cpu time is for both two lines, because I'm not quite sure if the profiler correctly separated the two lines of code. <pre class="prettyprint"> gcc: 4.2.1 4.4.7 4.6.3 4.7.0 ---------------------------------- test 0: 21.85 25.15 22.05 21.85 test 1: 21.9 25.05 22 22 test 2: 26.35 25.1 21.95 19.2 test 3: 7.15 8.35 18.55 19.95 test 4: 11.1 8.45 7.35 7.1 test 5: 7.1 7.8 6.9 7.05 </pre> or: <img src="https://i.stack.imgur.com/yXsKO.png" alt="casting performance"> Based on this, it seems that casting is expensive, regardless of what integer type I use. Also, it seems gcc 4.6 and 4.7 are not able to optimize loop 3 (size_t and uint_fast64_t) properly.

For your original questions: <ol> <li>The code is slow because it involves the conversion from integer to float data types. That's why it's easily sped up when you use also an integer datatype for the sum-variables because it doesn't require a float-conversion anymore. </li> <li>The difference is the result of several factors. For example it depends on how efficient a platform is able to perform an int->float conversion. Furthermore this conversion could also mess up processor-internal optimizations in the program flow and prediction engine, caches, ... and also the internal parallelizing-features of the processors can have a huge influence in such calculations.</li> </ol> For the additional questions: <ul> <li>"Surprisingly int is faster than uint_fast32_t"? What's the sizeof(size_t) and sizeof(int) on your platform? One guess I can make is, that both are probably 64bit and therefore a cast to 32bit not only can give you calculation errors but also includes a different-size-casting penalty.</li> </ul> In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable. In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.

Cast performance from size_t to double

Q: Can Size_t be a double?

Converting to double will not cause an overflow, but it could result in a loss of precision for a very large size_t value. Again, it doesn't make a lot of sense to convert a size_t to a double ; you're still better off keeping the value in a size_t variable.

Q: Can you cast from a Size_t to int c?

Just use casting. You can safely convert size_t to int by checking its range before casting it. If it is outside the valid int range, you can't. Test against INT_MAX to ensure no possible signed integer overflow before casting it.

Q: Can Size_t be float?

Since size_t must be an unsigned type, you first need to check explicitly if your float is negative: The behaviour on converting a float that's less than or equal to -1 to an unsigned type is undefined. The second job you need to do is to check that the float is within the upper range of size_t .

Tags:

performance

c

casting

TL;DR: Why is multiplying/casting data in size_t slow and why does this vary per platform?

I'm having some performance issues that I don't fully understand. The context is a camera frame grabber where a 128x128 uint16_t image is read and post-processed at a rate of several 100 Hz.

In the post-processing I generate a histogram frame->histo which is of uint32_t and has thismaxval = 2^16 elements, basically I tally all intensity values. Using this histogram I calculate the sum and squared sum:

double sum=0, sumsquared=0; size_t thismaxval = 1 << 16;  for(size_t i = 0; i < thismaxval; i++) {     sum += (double)i * frame->histo[i];     sumsquared += (double)(i * i) * frame->histo[i]; }

Profiling the code with profile I got the following (samples, percentage, code):

 58228 32.1263 :  sum += (double)i * frame->histo[i]; 116760 64.4204 :  sumsquared += (double)(i * i) * frame->histo[i];

or, the first line takes up 32% of CPU time, the second line 64%.

I did some benchmarking and it seems to be the datatype/casting that's problematic. When I change the code to

uint_fast64_t isum=0, isumsquared=0;  for(uint_fast32_t i = 0; i < thismaxval; i++) {     isum += i * frame->histo[i];     isumsquared += (i * i) * frame->histo[i]; }

it runs ~10x faster. However, this performance hit also varies per platform. On the workstation, a Core i7 CPU 950 @ 3.07GHz the code is 10x faster. On my Macbook8,1, which has a Intel Core i7 Sandy Bridge 2.7 GHz (2620M) the code is only 2x faster.

Now I am wondering:

Why is the original code so slow and easily sped up?
Why does this vary per platform so much?

Update:

I compiled the above code with

g++ -O3  -Wall cast_test.cc -o cast_test

Update2:

I ran the optimized codes through a profiler (Instruments on Mac, like Shark) and found two things:

1) The looping itself takes a considerable amount of time in some cases. thismaxval is of type size_t.

for(size_t i = 0; i < thismaxval; i++) takes 17% of my total runtime
for(uint_fast32_t i = 0; i < thismaxval; i++) takes 3.5%
for(int i = 0; i < thismaxval; i++) does not show up in the profiler, I assume it's less than 0.1%

2) The datatypes and casting matter as follows:

sumsquared += (double)(i * i) * histo[i]; 15% (with size_t i)
sumsquared += (double)(i * i) * histo[i]; 36% (with uint_fast32_t i)
isumsquared += (i * i) * histo[i]; 13% (with uint_fast32_t i, uint_fast64_t isumsquared)
isumsquared += (i * i) * histo[i]; 11% (with int i, uint_fast64_t isumsquared)

Surprisingly, int is faster than uint_fast32_t?

Update4:

I ran some more tests with different datatypes and different compilers, on one machine. The results are as follows.

For testd 0 -- 2 the relevant code is

    for(loop_t i = 0; i < thismaxval; i++)         sumsquared += (double)(i * i) * histo[i];

with sumsquared a double, and loop_t size_t, uint_fast32_t and int for tests 0, 1 and 2.

For tests 3--5 the code is

    for(loop_t i = 0; i < thismaxval; i++)         isumsquared += (i * i) * histo[i];

with isumsquared of type uint_fast64_t and loop_t again size_t, uint_fast32_t and int for tests 3, 4 and 5.

The compilers I used are gcc 4.2.1, gcc 4.4.7, gcc 4.6.3 and gcc 4.7.0. The timings are in percentages of total cpu time of the code, so they show relative performance, not absolute (although the runtime was quite constant at 21s). The cpu time is for both two lines, because I'm not quite sure if the profiler correctly separated the two lines of code.

 gcc:    4.2.1  4.4.7  4.6.3  4.7.0 ---------------------------------- test 0: 21.85  25.15  22.05  21.85 test 1: 21.9   25.05  22     22 test 2: 26.35  25.1   21.95  19.2 test 3: 7.15   8.35   18.55  19.95 test 4: 11.1   8.45   7.35   7.1 test 5: 7.1    7.8    6.9    7.05

or:

casting performance

Based on this, it seems that casting is expensive, regardless of what integer type I use.

Also, it seems gcc 4.6 and 4.7 are not able to optimize loop 3 (size_t and uint_fast64_t) properly.

864

asked May 15 '12 07:05

Tim

1 Answers

For your original questions:

The code is slow because it involves the conversion from integer to float data types. That's why it's easily sped up when you use also an integer datatype for the sum-variables because it doesn't require a float-conversion anymore.
The difference is the result of several factors. For example it depends on how efficient a platform is able to perform an int->float conversion. Furthermore this conversion could also mess up processor-internal optimizations in the program flow and prediction engine, caches, ... and also the internal parallelizing-features of the processors can have a huge influence in such calculations.

For the additional questions:

"Surprisingly int is faster than uint_fast32_t"? What's the sizeof(size_t) and sizeof(int) on your platform? One guess I can make is, that both are probably 64bit and therefore a cast to 32bit not only can give you calculation errors but also includes a different-size-casting penalty.

In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable. In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.

148

answered Oct 11 '22 03:10

Kosi2801

Related questions
                            
                                Missing separator in Makefile?
                            
                                Where is my "stdio.h" in Mac?
                            
                                printf slows down my program
                            
                                Binary, Floats, and Modern Computers
                            
                                Interview Hello World explanation
                            
                                waitpid, wnohang, wuntraced. How do I use these
                            
                                How to find the size of integer array
                            
                                0xDEADBEEF vs. NULL
                            
                                How to generate core dump file in Ubuntu [duplicate]
                            
                                Find the size of a string pointed by a pointer
                            
                                Calculating pow(a,b) mod n
                            
                                What does "12345" + 2 do in C?
                            
                                Implementation of C lower_bound
                            
                                What exactly does fork return?
                            
                                faster equivalent of gettimeofday
                            
                                Fast Arc Cos algorithm?
                            
                                Is the code "while(condition);" valid and what does it mean?
                            
                                What type of programs are best written in C [closed]
                            
                                Formal methods in C++ for safety critical software
                            
                                Why is compiler generating 4-byte load instead of 1-byte load where the wider load may access unmapped data?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With