Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speedup C++ code

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:

 template<class T> inline T sqr(const T& x){return x*x;} 

and another one that calculates

Base   dist2(const Point& p) const        { return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); } 

These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss

Thanks

like image 867
Open the way Avatar asked May 11 '10 14:05

Open the way


People also ask

What is Optimisation in C?

Optimization is a program transformation technique, which tries to improve the code by making it consume less resources (i.e. CPU, Memory) and deliver high speed. In optimization, high-level general programming constructs are replaced by very efficient low-level programming codes.

Which is faster C or C+?

Performance-based on Nature Of Language C++ language is an object-oriented programming language, and it supports some important features like Polymorphism, Abstract Data Types, Encapsulation, etc. Since it supports object-orientation, speed is faster compared to the C language.

How do I optimize my program performance?

The first step in optimizing a program is to eliminate unnecessary work, making the code perform its in- tended task as efficiently as possible. This includes eliminating unnecessary function calls, conditional tests, and memory references.


2 Answers

First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).

Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.

If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)

To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.

like image 60
guesser Avatar answered Sep 23 '22 06:09

guesser


What is Base?

Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.

template<class T> inline T sqr(const T& x){return x*x;} Base   dist2(const Point& p) const {   return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); } 

If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.

(We can't tell without the class definitions)

You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.

Other performance improvement opportunities are to:

  • Use non-floating point operations, if you're ok with less precision.
  • Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
  • (this is probably obvious, but) Make sure you're compiling with optimization turned on.
like image 25
Stephen Avatar answered Sep 21 '22 06:09

Stephen