Can my loop be optimized any more?

Question

Below is my innermost loop that's run several thousand times, with input sizes of 20 - 1000 or more. This piece of code takes up 99 - 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?

I'm not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.

Any help is appreciated!

Edit: I'm running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)

typedef double real;  struct Particle {     Vector pos, vel, acc, jerk;     Vector oldPos, oldVel, oldAcc, oldJerk;     real mass; };  class Vector { private:     real vec[3];  public:     // Operators defined here };  real Gravity::interact(Particle *p, size_t numParticles) {     PROFILE_FUNC();     real tau_q = 1e300;      for (size_t i = 0; i < numParticles; i++)     {         p[i].jerk = 0;         p[i].acc = 0;     }      for (size_t i = 0; i < numParticles; i++)     {         for (size_t j = i+1; j < numParticles; j++)         {             Vector r = p[j].pos - p[i].pos;             Vector v = p[j].vel - p[i].vel;             real r2 = lengthsq(r);             real v2 = lengthsq(v);              // Calculate inverse of |r|^3             real r3i = Constants::G * pow(r2, -1.5);              // da = r / |r|^3             // dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5             Vector da = r * r3i;             Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;              // Calculate new acceleration and jerk             p[i].acc += da * p[j].mass;             p[i].jerk += dj * p[j].mass;             p[j].acc -= da * p[i].mass;             p[j].jerk -= dj * p[i].mass;              // Collision estimation             // Metric 1) tau = |r|^2 / |a(j) - a(i)|             // Metric 2) tau = |r|^4 / |v|^4             real mij = p[i].mass + p[j].mass;             real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);             real tau_est_q2 = (r2*r2) / (v2*v2);              if (tau_est_q1 < tau_q)                 tau_q = tau_est_q1;             if (tau_est_q2 < tau_q)                 tau_q = tau_est_q2;         }     }      return sqrt(sqrt(tau_q)); }

8 revs · Accepted Answer

Inline the calls to lengthsq().
Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5
Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.

4a. Try replacing the if (...) tau_q = with

    tau_q=minimum(...,...)

Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.

4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]

[EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.

Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

Can my loop be optimized any more?

Tags:

Mike Bailey

1 Answers

8 revs

Recent Activity

Donate For Us

Can my loop be optimized any more?

Tags:

Mike Bailey

1 Answers

8 revs

Related questions

Recent Activity

Donate For Us