Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

matplotlib: faster PDF generation?

I would like to use matplotlib to generate a number of PDF files. My main problem is that matplotlib is slow, taking order of 0.5 seconds per file.

I tried to figure out why it takes so long, and I wrote the following test program that just plots a very simple curve as a PDF file:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

X = range(10)
Y = [ x**2 for x in X ]

for n in range(100):
    fig = plt.figure(figsize=(6,6))
    ax = fig.add_subplot(111)
    ax.plot(X, Y)
    fig.savefig("test.pdf")

But even something as simple as this takes a lot of time: 15–20 second in total for 100 PDF files (modern Intel platforms, I tried both Mac OS X and Linux systems).

Are there any tricks and techniques that I can use to speed up PDF generation in matplotlib? Obviously I can use multiple parallel threads on multi-core platforms, but is there anything else that I can do?

like image 350
Jukka Suomela Avatar asked Aug 19 '12 13:08

Jukka Suomela


2 Answers

If its practical, you could use multiprocess to do this (assuming you have multiple cores on your machine):

NOTE: The following code will produce 40 pdfs in the present directory on your machine

import matplotlib.pyplot as plt

import multiprocessing


def do_plot(y_pos):
    fig = plt.figure()
    ax = plt.axes()
    ax.axhline(y_pos)
    fig.savefig('%s.pdf' % y_pos)

pool = multiprocessing.Pool()

for i in xrange(40):
    pool.apply_async(do_plot, [i])

pool.close()
pool.join()

It doesn't scale perfectly, but I get a significant boost by doing this on my 4 cores (dual-core with hypertheading):

$> time python multi_pool_1.py 
done

real    0m5.218s
user    0m4.901s
sys 0m0.205s

$> time python multi_pool_n.py 
done

real    0m2.935s
user    0m9.022s
sys 0m0.420s

I'm sure there is a lot of scope for performance improvements on the pdf backend of mpl, but that is not on the timescale you are after.

HTH,

like image 120
pelson Avatar answered Sep 28 '22 07:09

pelson


Matplotlib has a lot of overhead for creation of the figure, etc. even before saving it to pdf. So if your plots are similar you can safe a lot of "setting up" by reusing elements, just like you will find in animation examples for matplotlib.

You can reuse the figure and axes in this example:

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

X = range(10)
Y = [ x**2 for x in X ]
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)


for n in range(100):
    ax.clear() # or even better just line.remove()
               # but should interfere with autoscaling see also below about that
    line = ax.plot(X, Y)[0]
    fig.savefig("test.pdf")

Note that this does not help that much. You can save quite a bit more, by reusing the lines:

line = ax.plot(X, Y)[0]
for n in range(100):
    # Now instead of plotting, we update the current line:
    line.set_xdata(X)
    line.set_ydata(Y)
    # If autoscaling is necessary:
    ax.relim()
    ax.autoscale()

    fig.savefig("test.pdf")

This is close to twice as fast as the initial example for me. This is only an option if you do similar plots, but if they are very similar, it can speed up things a lot. The matplotlib animation examples may have inspiration for this kind of optimization.

like image 35
seberg Avatar answered Sep 28 '22 07:09

seberg