Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most performant calculation of Newtonian forces in numpy/scipy

For an exercise in university, we had to implement a Leapfrog integrator with exact Newtonian forces in Python. The course is over and our solutions were all more than good enough, but I wondered if/how one could improve the performance of the force calculation even more.

The bottleneck is to calculate all forces (aka accelerations):

a<sub>i</sub> = Σ<sub>j≠i</sub> Gm<sub>j</sub> / |r<sub>1</sub>-r<sub>2</sub>|<sup>3</sup> * (r<sub>1</sub>-r<sub>2</sub>)

for a large (1000 and bigger) number of particles N (i,j<N).

Here r1 and r2 are the 3-dimensional vectors of the position of the particles stored in an ndarray of shape (N,3) and Gm is the particle mass times the gravitational constant which I saved in an ndarray of shape (N).

The fastest version I found so far is the following:

def a(self):
    sep = self.r[np.newaxis, :] - self.r[:, np.newaxis]
    # Calculate the distances between all particles with cdist
    # this is much faster than by hand
    dists = cdist(self.r, self.r)
    scale =dists*dists*dists
    # set diagonal elements of dist to something != 0, to avoid division by 0
    np.fill_diagonal(scale,1)
    Fsum = (sep/scale.reshape(self.particlenr,self.particlenr,1))*self.Gm[:,None]
    return np.add.reduce(Fsum,axis=1)

But it bugs me that this is probably not the fastest version. The first line seems to be too slow when comparing to cdist which is doing essentially the same calculation. Also, this solution does not use the symmetry of switching r1 and r2 in the problem and calculates all elements twice.

Do you know any performance improvements (without changing the force-calculation to some approximation or changing the programming language)?

like image 431
Nudin Avatar asked Feb 17 '15 00:02

Nudin


1 Answers

I'll give it a try: I implemented a routine, which determines a single a_i:

import numpy as np

GM = .01  #  article mass times the gravitation

def calc_a_i(rr, i):
    """ Calculate one a_i """
    drr = rr - rr[i, :] # r_j - r_i
    dr3 = np.linalg.norm(drr, axis=1)**3  # |r_j - r_i|**3
    dr3[i] = 1  # case i==j: drr = [0, 0, 0]
    # this would be more robust (elimnate small denominators):
    #dr3 = np.where(np.abs(dr3) > 1e-12, dr3, 1)
    return np.sum(drr.T/dr3, axis=1)

n = 4000 # number of particles
rr = np.random.randn(n, 3) # generate some particles

# Calculate each a_i separately:
aa = np.array([calc_a_i(rr, i) for i in range(n)]) * GM # all a_i

To test it, I ran:

In [1]: %timeit aa = np.array([calc_a_i(rr, i) for i in range(n)])
1 loops, best of 3: 2.93 s per loop

The easiest way to speed up such code, is to use numexpr for faster evaluation of array expressions:

import numexpr as ne
ne.set_num_threads(1)  # multithreading causes to much overhead

def ne_calc_a_i( i):
    """ Use numexpr - here rr is global for easier parallelization"""
    dr1, dr2, dr3 = (rr - rr[i, :]).T # r_j - r_i
    drrp3 = ne.evaluate("sqrt(dr1**2 + dr2**2 + dr3**2)**3")
    drrp3[i] = 1
    return np.sum(np.vstack([dr1, dr2, dr3])/drrp3, axis=1)

# Calculate each a_i separately:
aa_ne = np.array([ne_calc_a_i(i) for i in range(n)]) * GM  # all a_i    

This improves the speed by a factor of 2:

    In [2]: %timeit aa_ne = np.array([ne_calc_a_i(i) for i in range(n)])
    1 loops, best of 3: 1.29 s per loop

To speed up the code further, let's run it on a IPython Cluster:

# Start local cluster with 4 clients in a shell with: 
# ipcluster start -n 4

rc = Client()  # clients of cluster
dview = rc[:]  # view of clusters

dview.execute("import numpy as np")  # import libraries on clients
dview.execute("import numexpr as ne")
dview.execute("ne.set_num_threads(1)")

def para_calc_a(dview, rr):
    """ Only in function for %timeit """
    # send rr and ne_calc_a_i() to clients:
    dview.push(dict(rr=rr, ne_calc_a_i=ne_calc_a_i), block=True)
    return np.array(dview.map_sync(ne_calc_a_i, range(n)))*GM

The speedup is more than a factor of four:

In[3] %timeit aa_p = para_calc_a(dview, rr)
1 loops, best of 3: 612 ms per loop

As @mathdan already noted, it is not obvious how to optimize such a problem: It depends on your CPU architecture if the memory bus or the floating point unit is the limiting factor, which calls for different techniques.

For further gains you might want to look at Theano: It can dynamically generate code GPU code from Python.

like image 93
Dietrich Avatar answered Oct 04 '22 17:10

Dietrich