I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double: <pre class="prettyprint"><code> template<class T> inline T sqr(const T& x){return x*x;} </code></pre> and another one that calculates <pre class="prettyprint"><code>Base dist2(const Point& p) const { return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); } </code></pre> These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss Thanks

First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case). Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory. If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation) To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.

What is <code>Base</code>? Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary <code>Base</code> objects. That could be a big CPU hog. <pre class="prettyprint"><code>template<class T> inline T sqr(const T& x){return x*x;} Base dist2(const Point& p) const { return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); } </code></pre> If <code>p</code>'s member variables are of type <code>Base</code>, you could be calling <code>sqr</code> on Base objects, which will be creating temporaries for the subtracted coordinates, in <code>sqr</code>, and then for each added component. (We can't tell without the class definitions) You could probably speed it up by forcing the sqr calls to be on primitves and not using <code>Base</code> until you get to the return type of <code>dist2</code>. Other performance improvement opportunities are to: <ul> <li>Use non-floating point operations, if you're ok with less precision.</li> <li>Use algorithms which don't need to call <code>dist2</code> so much, possibly caching or using the transitive property.</li> <li>(this is probably obvious, but) Make sure you're compiling with optimization turned on.</li> </ul>

Speedup C++ code

Tags:

c++

optimization

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:

Click to copy

 template<class T> inline T sqr(const T& x){return x*x;}

and another one that calculates

Click to copy

Base   dist2(const Point& p) const        { return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }

These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss

Thanks

867

asked May 11 '10 14:05

Open the way

2 Answers

First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).

Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.

If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)

To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.

answered Sep 23 '22 06:09

guesser

What is Base?

Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.

Click to copy

template<class T> inline T sqr(const T& x){return x*x;} Base   dist2(const Point& p) const {   return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }

If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.

(We can't tell without the class definitions)

You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.

Other performance improvement opportunities are to:

Use non-floating point operations, if you're ok with less precision.
Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
(this is probably obvious, but) Make sure you're compiling with optimization turned on.

answered Sep 21 '22 06:09

Stephen

Related questions
                            
                                Use new operator to initialise an array
                            
                                Moving an object into a map
                            
                                dynamic_cast across a shared_ptr?
                            
                                What is the reason behind having only one return value in C++ and Java?
                            
                                window border width and height in Win32 - how do I get it?
                            
                                source code of c/c++ functions
                            
                                When do you prefer using std::list<T> instead of std::vector<T>?
                            
                                How to forbid the use of the default constructor in C++?
                            
                                When is an object "out of scope"?
                            
                                Overhead of pthread mutexes?
                            
                                Famous design patterns that a C++ programmer should know [duplicate]
                            
                                C++ Undefined Reference to vtable and inheritance
                            
                                Char array to hex string C++
                            
                                VS Code will not build c++ programs with multiple .ccp source files
                            
                                Is "sizeof new int;" undefined behavior?
                            
                                How to access static members of a class?
                            
                                Delete a pointer to pointer (as array of arrays)
                            
                                how to find the source of some macros
                            
                                Arithmetic right shift gives bogus result?
                            
                                System.Windows.Markup.XamlParseException' occurred in PresentationFramework.dll?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Speedup C++ code

Tags:

c++

optimization

Open the way

People also ask

2 Answers

guesser

Stephen

Recent Activity

Donate For Us