Running scipy.integrate.ode in multiprocessing Pool results in huge performance hit

Question

I'm using python's scipy.integrate to simulate a 29-dimensional linear system of differential equations. Since I need to solve several problem instances, I thought I could speed it up by doing computations in parallel using multiprocessing.Pool. Since there is no shared data or synchronization necessary between threads (the problem is embarrassingly parallel), I thought this should obviously work. After I wrote the code to do this, however, I got very strange performance measurements:

Single-threaded, without jacobian: 20-30 ms per call
Single-threaded, with jacobian: 10-20 ms per call
Multi-threaded, without jacobian: 20-30 ms per call
Multi-threaded, with jacobian: 10-5000 ms per call

What's shocking is that what I thought should be the fastest setup, was actually the slowest, and the variability was two orders of magnitude. It's a deterministic computation; computers aren't supposed to work this way. What could possibly be causing this?

Effect seems system-dependent

I tried the same code on another computer and I didn't see this effect.

Both machines were using Ubuntu 64 bit, Python 2.7.6, scipy version 0.18.0, and numpy version 1.8.2. I didn't see the variability with an Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz processor. I did see the issue with an Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz.

Theories

One thought was that there might be a shared cache among processors, and by running it in parallel I can't fit two instances of the jacobian matrix in the cache, so they constantly battle each other for the cache slowing each other down compared with if they are run serially or without the jacobian. But it's not a million variable system. The jacobian is a 29x29 matrix, which takes up 6728 bytes. The level 1 cache on the processor is 4 x 32 KB, much larger. Are there any other shared resources between processors that might be to blame? How can we test this?

Another thing I noticed is that each python process seems to take several hundred percent of the CPU as it's running. This seems to mean that the code is already parallelized at some point (perhaps in the low-level library). This could mean that further parallelization wouldn't help, but I wouldn't expect such a dramatic slowdown.

Code

It would be good to try out the on more machines to see if (1) other people can experience the slowdown at all and (2) what are the common features of systems where the slowdown occurs. The code does 10 trials of two parallel computations using a multiprocessing pool of size two, printing out the time per scipy.ode.integrate call for each of the 10 trials.

'odeint with multiprocessing variable execution time demonsrtation'

from numpy import dot as npdot
from numpy import add as npadd
from numpy import matrix as npmatrix
from scipy.integrate import ode
from multiprocessing import Pool
import time

def main():
    "main function"

    pool = Pool(2) # try Pool(1)
    params = [0] * 2

    for trial in xrange(10):
        res = pool.map(run_one, params)
        print "{}. times: {}ms, {}ms".format(trial, int(1000 * res[0]), int(1000 * res[1]))

def run_one(_):
    "perform one simulation"

    final_time = 2.0
    init_state = [0.1 if d < 7 else 0.0 for d in xrange(29)]
    (a_matrix, b_vector) = get_dynamics()

    derivative = lambda dummy_t, state: npadd(npdot(a_matrix, state), b_vector)
    jacobian = lambda dummy_t, dummy_state: a_matrix
    #jacobian = None # try without the jacobian

    #print "jacobian bytes:", jacobian(0, 0).nbytes

    solver = ode(derivative, jacobian)
    solver.set_integrator('vode')
    solver.set_initial_value(init_state, 0)

    start = time.time()
    solver.integrate(final_time)
    dif = time.time() - start

    return dif

def get_dynamics():
    "return a tuple (A, b), which are the system dynamics x' = Ax + b"

    return \
    (
        npmatrix([
        [0, 0, 0, 0.99857378006, 0.053384274244, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [0, 0, 1, -0.003182219341, 0.059524655342, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [0, 0, -11.570495605469, -2.544637680054, -0.063602626324, 0.106780529022, -0.09491866827, 0.007107574493, -5.20817921341, -23.125876742495, -4.246931301528, -0.710743697134, -1.486697327603, -0.044548215175, 0.03436637817, 0.022990248611, 0.580153205353, 1.047552018229, 11.265023544535, 2.622275290571, 0.382949404795, 0.453076470454, 0.022651889536, 0.012533628369, 0.108399390974, -0.160139432044, -6.115359574845, -0.038972389136, 0, ],
        [0, 0, 0.439356565475, -1.998182296753, 0, 0.016651883721, 0.018462046981, -0.001187470742, -10.778778281386, 0.343052863546, -0.034949331535, -3.466737362551, 0.013415853489, -0.006501746896, -0.007248032248, -0.004835912875, -0.152495086764, 2.03915052839, -0.169614300211, -0.279125393264, -0.003678218266, -0.001679708185, 0.050812027754, 0.043273505033, -0.062305315646, 0.979162836629, 0.040401368402, 0.010697028656, 0, ],
        [0, 0, -2.040895462036, -0.458999156952, -0.73502779007, 0.019255757332, -0.00459562242, 0.002120360732, -1.06432932386, -3.659159530947, -0.493546966858, -0.059561101143, -1.953512259413, -0.010939065041, -0.000271004496, 0.050563886711, 1.58833954495, 0.219923768171, 1.821923233098, 2.69319056633, 0.068619628466, 0.086310028398, 0.002415425662, 0.000727041422, 0.640963888079, -0.023016712545, -1.069845542887, -0.596675149197, 0, ],
        [-32.103607177734, 0, -0.503355026245, 2.297859191895, 0, -0.021215811372, -0.02116791904, 0.01581159234, 12.45916782984, -0.353636907076, 0.064136531117, 4.035326800046, -0.272152744884, 0.000999589868, 0.002529691904, 0.111632959213, 2.736421830861, -2.354540136198, 0.175216915979, 0.86308171287, 0.004401276193, 0.004373406589, -0.059795009475, -0.051005479746, 0.609531777761, -1.1157829788, -0.026305051933, -0.033738880627, 0, ],
        [0.102161169052, 32.057830810547, -2.347217559814, -0.503611564636, 0.83494758606, 0.02122657001, -0.037879735231, 0.00035400386, -0.761479736492, -5.12933410588, -1.131382179292, -0.148788337148, 1.380741054924, -0.012931029503, 0.007645723855, 0.073796656681, 1.361745395486, 0.150700793731, 2.452437244444, -1.44883919298, 0.076516270282, 0.087122640348, 0.004623192159, 0.002635233443, -0.079401941141, -0.031023369979, -1.225533436977, 0.657926151362, 0, ],
        [-1.910972595215, 1.713829040527, -0.004005432129, -0.057411193848, 0, 0.013989634812, -0.000906753354, -0.290513515472, -2.060635522957, -0.774845915178, -0.471751979387, -1.213891560083, 5.030515136324, 0.126407660877, 0.113188603433, -2.078420624662, -50.18523312358, 0.340665548784, 0.375863242926, -10.641168797333, -0.003634153255, -0.047962774317, 0.030509705209, 0.027584169642, -10.542357589006, -0.126840767097, -0.391839285172, 0.420788121692, 0, ],
        [0.126296110212, -0.002898250629, -0.319316070797, 0.785201711657, 0.001772374259, 0.00000584372, 0.000005233812, -0.000097899495, -0.072611454126, 0.001666291957, 0.195701043078, 0.517339177294, 0.05236528267, -0.000003359731, -0.000003009077, 0.000056285381, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [-0.018114066432, 0.077615035084, 0.710897211118, 2.454275059389, -0.012792968774, 0.000040510624, 0.000036282541, -0.000678672106, 0.010414324729, -0.044623231468, 0.564308412696, -1.507321670112, 0.066879720068, -0.000023290783, -0.00002085993, 0.000390189123, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [-0.019957254425, 0.007108972111, 122.639137999354, 1.791704310155, 0.138329792976, 0.000000726169, 0.000000650379, -0.000012165459, -8.481152717711, -37.713895394132, -93.658221074435, -4.801972165378, -2.567389718833, 0.034138340146, -0.038880106034, 0.044603217363, 0.946016722396, 1.708172458034, 18.369114490772, 4.275967542224, 0.624449778826, 0.738801257357, 0.036936909247, 0.020437742859, 0.176759579388, -0.261128576436, -9.971904607075, -0.063549647738, 0, ],
        [0.007852964982, 0.003925745426, 0.287856349997, 58.053471054491, 0.030698062827, -0.000006837601, -0.000006123962, 0.000114549925, -17.580742026275, 0.55713614874, 0.205946900184, -43.230778067404, 0.004227082975, 0.006053854501, 0.006646690253, -0.009138926083, -0.248663457912, 3.325105302428, -0.276578605231, -0.455150962257, -0.005997822569, -0.002738986905, 0.082855748293, 0.070563187482, -0.101597078067, 1.596654829885, 0.065879787896, 0.017442923517, 0, ],
        [0.011497315687, -0.012583019909, 13.848373855148, 22.28881517216, 0.042287331657, 0.000197558695, 0.000176939544, -0.003309689199, -1.742140233901, -5.959510415282, -11.333020298294, -14.216479234895, -3.944800806497, 0.001304578929, -0.005139259078, 0.08647432259, 2.589998222025, 0.358614863147, 2.970887395829, 4.39160430183, 0.111893402319, 0.140739944934, 0.003938671797, 0.001185537435, 1.045176603318, -0.037531801533, -1.744525005833, -0.972957942438, 0, ],
        [-16.939142002537, 0.618053512295, 107.92089190414, 204.524147386814, 0.204407545189, 0.004742101706, 0.004247169746, -0.079444150933, -2.048456967261, -0.931989524708, -66.540858220883, -116.470289129818, -0.561301215495, -0.022312225275, -0.019484747345, 0.243518778973, 4.462098610572, -3.839389874682, 0.285714413078, 1.40736916669, 0.007176864388, 0.007131419303, -0.097503691021, -0.083171197416, 0.993922379938, -1.819432085819, -0.042893874898, -0.055015718216, 0, ],
        [-0.542809857455, 7.081822285872, -135.012404429101, 460.929268260027, 0.036498617908, 0.006937238413, 0.006213200589, -0.116219147061, -0.827454697348, 19.622217613195, 78.553728334274, -283.23862765888, 3.065444785639, -0.003847616297, -0.028984525722, 0.187507140282, 2.220506417769, 0.245737625222, 3.99902408961, -2.362524402134, 0.124769923797, 0.142065016461, 0.007538727793, 0.004297097528, -0.129475392736, -0.050587718062, -1.998394759416, 1.072835822585, 0, ],
        [-1.286456393795, 0.142279456389, -1.265748910581, 65.74306027738, -1.320702989799, -0.061855995532, -0.055400100872, 1.036269854556, -4.531489334771, 0.368539277612, 0.002487097952, -42.326462719738, 8.96223401238, 0.255676968878, 0.215513465742, -4.275436802385, -81.833676543035, 0.555500345288, 0.612894852362, -17.351836610113, -0.005925968725, -0.078209662789, 0.049750119549, 0.044979645917, -17.190711833803, -0.206830688253, -0.638945907467, 0.686150823668, 0, ],
        [0, 0, 0, 0, 0, -0.009702263896, -0.008689641059, 0.162541456323, 0, 0, 0, 0, 0, 0, 0, 0, -0.012, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [-8.153162937544, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.005, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [0, -3.261265175018, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.005, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [0, 0, 0, 0.17441246156, -3.261265175018, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [0, 0, -3.261265175018, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -8.5, -18, 0, 0, 0, 0, 0, 0, 0, ],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ],
        [0, 0, 0, -8.153162937544, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -8.5, -18, 0, 0, 0, 0, 0, ],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ],
        [0, 0, 0, 0, 0, 0, 0, 0, 0.699960862226, 0.262038222227, 0.159589891262, 0.41155156501, -1.701619176699, -0.0427567124, -0.038285155304, 0.703045934017, 16.975651534025, -0.115788018654, -0.127109026104, 3.599544290134, 0.001229743857, 0.016223661959, -0.01033400498, -0.00934235613, -6.433934989563, 0.042639567847, 0.132540852847, -0.142338323726, 0, ],
        [0, 0, 0, 0, 0, 0, 0, 0, -37.001496211974, 0.783588795613, -0.183854784348, -11.869599790688, -0.106084318011, -0.026306590251, -0.027118088888, 0.036744952758, 0.76460150301, 7.002366574508, -0.390318898363, -0.642631203146, -0.005701671024, 0.003522251111, 0.173867535377, 0.147911422248, 0.056092715216, -6.641979472328, 0.039602243105, 0.026181724138, 0, ],
        [0, 0, 0, 0, 0, 0, 0, 0, 1.991401999957, 13.760045912368, 2.53041689113, 0.082528789604, 0.728264862053, 0.023902766734, -0.022896554363, 0.015327568208, 0.370476566397, -0.412566245022, -6.70094564846, -1.327038338854, -0.227019235965, -0.267482033427, -0.008650986307, -0.003394359441, 0.098792645471, 0.197714179668, -6.369398456151, -0.011976840769, 0, ],
        [0, 0, 0, 0, 0, 0, 0, 0, 1.965859332057, -3.743127938662, -1.962645156793, 0.018929412474, 11.145046656101, -0.03600197464, -0.001222148117, 0.602488409354, 11.639787952728, -0.407672972316, 1.507740702165, -12.799953897143, 0.005393102236, -0.014208764492, -0.000915158115, -0.000640326416, -0.03653528842, 0.012458973237, -0.083125038259, -5.472831842357, 0, ],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ],
        ])
    , 
        npmatrix([1.0 if d == 28 else 0.0 for d in xrange(29)])
    )



if __name__ == "__main__":
    main()

Example Output

Here's an example of the output that demonstrates the problem (each run is slightly different). Notice the large variability in execution times (over two orders of magnitude!). Again, this all goes away if I either use a pool of size 1 (or run the code without a pool), or if I don't use an explicit jacobian in the call to integrate.

times: 5847ms, 5760ms

times: 4177ms, 3991ms

times: 229ms, 36ms

times: 1317ms, 1544ms

times: 87ms, 100ms

times: 113ms, 102ms

times: 4747ms, 5077ms

times: 597ms, 48ms

times: 9ms, 49ms

times: 135ms, 109ms

Daniel Wisehart · Accepted Answer

Based on the variability of the execution times you posted for the machine showing the problem, I wonder what else that computer is doing at the time you are running your test. Here are times I saw when I ran your code on an AWS r3.large server (2 cores, 15 GB of RAM) that normally runs interactive R sessions but is currently mostly idle:

times: 11ms, 11ms

times: 9ms, 9ms

times: 9ms, 9ms

times: 9ms, 9ms

times: 10ms, 10ms

times: 10ms, 10ms

times: 10ms, 10ms

times: 11ms, 10ms

times: 11ms, 10ms

times: 9ms, 9ms

Is it possible your machine is swapping and you do not know it? vmstat 5 will give you a lot of information about swap in and outs, but not about cache evictions.

Intel makes some very nice tools for monitoring--two at a time--thousands of different types of operations and errors going on in a processor--including L2 cache evictions--but they are a bit of a firehose: there is information generated every microsecond--or more frequently--and you have to decide what you are going to monitor and how often you want an interrupt to deliver the numbers into your software. Likely it will take many runs just to narrow down the stats you want to track and you still have to filter out the noise generated by the operating system and everything else running at the time. It is a time consuming process, but if you follow it to the end and run many different tests you will come to understand what is going on.

But is this--shared cache resources in a processor--really your question? It seems more like you just want to figure out why you have variable run times on one machine and, second, why multi-threaded is slower on both machines than single threaded. Do I have it right? If not I will edit my answer and we can talk about processor cache, cache snooping and cache coherency.

So, as to the variability on the i7-2670QM CPU machine, I would start with htop, vmstat 5 and iostat 5 to see if the machine is doing something you didn't realize. That much variability says the executable is getting stalled because the processor is busy doing something else: going off to the network and not finding a share it expects, unable to connect to a DNS server, getting kerbios failures: it could be a lot of things including hardware failures from a hard disk that is being continually reset. Oh, and move your program to /dev/shm and cd there before you start it. That won't help you if there are Python libraries in a bad place on a disk, but at least you won't have issues with your local directory. Report back what you find and we can make further suggestions.

Your second question as I see it, which is perhaps where you started, is why is your program slower when run multi-threaded than single-threaded. This is a big subject that will come a lot more in focus if we can see how you multi-threaded it. But even before we do you have to realize that there are several things that can cause a multi-threaded program to run slower than a single-threaded program, and it can have as much to do with the support infrastructure around your program--libraries and operating system calls you make--as your program. Just because you do not need mutexes does not mean the libraries and operating system do not need them when they are being called from a multi-threaded application. Locking a mutex is an expensive operation, especially as different threads are rotated between different cores.

On top of that, since the vode is not re-entrant, if you called it from multiple threads it is possible that it is having trouble finding convergence and having to recalculate the same values many times before it "gets lucky" and has enough processor time to complete an iteration before it is swapped out and intermediate results are overwritten. Give us the code you are using for your multi-threaded runs and I will add to this answer.

3 revs · Answer

This is intended as a formatted comment regarding the mathematical background raised in a comment by @Dietrich. As it doesn't address the programming question, I intend to delete this answer in a little while until the bounty blows over.

As @Dietrich noted, you can solve your ODE exactly, since if

x' = A*x,

then the exact solution is

x(t) = exp(A*t)*x0

Already I'd say that an exact solution is always superior than a numerical approximation, but this can indeed be faster than a numerical integration. As you noted in a comment, you're worried about efficiency. So don't compute the matrix exponential for each t: compute the eigensystem of A only once:

A*v_i = L_i*v_i

then

x(t) = sum_i c_i*v_i*exp(L_i*t),

and the coefficients c_i can be determined from the linear equations

x0 = sum_i c_i*v_i.

Now, having an inhomogeneous term doesn't change much, as long as your matrix is not singular:

x' = A*x + b
(x - A^(-1)*b)' = A*(x - A^(-1)*b)

so we can solve the homogeneous equation for y = x - A^(-1)*b and in a final step recover x = y + A^(-1)*b.

This all works nicely while the matrix is regular, but in your specific case it's singular. But it turns out that this is due to your final dimension:

>>> np.linalg.det(A)
0.0
>>> np.linalg.det(A[:-1,:-1])
1920987.0461154305

And also note that the final row of A is all zeros (this is the reason for the singularity of A). So the last dimension of x is constant (or changes linearly due to b).

I suggest eliminating this variable, rewriting your equation for the rest of the variables, and solving the non-singular inhomogeneous linear system of ODEs using the above procedure, exactly. It should be faster and precise.

The following will be a bit speculative, see also the caveat at the end.

In case of user-input A and b, things might get trickier. Finding a zero row/column in your matrix would be easy, but A can be singular even though none of its rows/columns are fully zero. I'm not an expert in the subject, but I think your best bet is using something akin to principal component analysis: transforming your system of equations according to the eigensystem of A. My following thoughts will still assume that A is diagonalizable, but mostly because I'm unfamiliar with singular value decomposition. In realistic cases I'd expect your matrices to be diagonalizable, even if singular.

So I'll assume that the matrix A can be decomposed as

A = V * D * V^(-1),

where D is a diagonal matrix containing the eigenvalues of A, and columns of V are the eigenvectors of A corresponding to each respective eigenvalue. The very same decomposition can be obtained in numpy using

DD,V = np.linalg.eig(A)
D = np.asmatrix(np.diag(DD))

I usually prefer using ndarrays instead of matrices, but this way V*D*np.linalg.inv(V) would really correspond to the matrix product of the three matrices, rather than calling np.dot twice.

Now, rewrite your equation again:

x' = A*x + b
x' = V*D*V^(-1)*x + b
V^(-1)*x' = D*V^(-1)*x + V^(-1)*b

By defining the auxiliary variables

X = V^(-1)*x
B = V^(-1)*b

we obtain

X' = D*X + B

i.e. the usual inhomogeneous form, but now D is a diagonal matrix containing the eigenvalues of A in the diagonal.

Since A is singular, some of the eigenvalues are zero. Look for zero elements in D (well, you can do that already with DD from eig()), and you'll know that they behave trivially during time-evolution. The remaining variables behave well, although at this point we see that the equations for X are decoupled due to D being diagonal, so you could integrate each independently and analytically. For this you need to first go from your initial condition x0 to X0 = np.linalg.inv(V)*x0, then after solving the equations, back to x = V*X.

Caveat: as I said, I'm not an expert in this subject. I can easily imagine that the inversions involved in the diagonalization can be a numerical issue in practical applications. So I'd first test if the matrix is singular, and only carry on with this procedure if it is (or nearly is). It's possible that the above carries a lot of error, in which case numerical integration might be better (I really can't tell).

Running scipy.integrate.ode in multiprocessing Pool results in huge performance hit

Tags:

python

numpy

scipy

python-multiprocessing

Effect seems system-dependent

Theories

Code

Example Output

Stanley Bak

2 Answers

Daniel Wisehart

3 revs

Recent Activity

Donate For Us

Running scipy.integrate.ode in multiprocessing Pool results in huge performance hit

Tags:

python

numpy

scipy

python-multiprocessing

Effect seems system-dependent

Theories

Code

Example Output

Stanley Bak

2 Answers

Daniel Wisehart

3 revs

Related questions

Recent Activity

Donate For Us