I would like to use matplotlib to generate a number of PDF files. My main problem is that matplotlib is slow, taking order of 0.5 seconds per file.
I tried to figure out why it takes so long, and I wrote the following test program that just plots a very simple curve as a PDF file:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
X = range(10)
Y = [ x**2 for x in X ]
for n in range(100):
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
ax.plot(X, Y)
fig.savefig("test.pdf")
But even something as simple as this takes a lot of time: 15–20 second in total for 100 PDF files (modern Intel platforms, I tried both Mac OS X and Linux systems).
Are there any tricks and techniques that I can use to speed up PDF generation in matplotlib? Obviously I can use multiple parallel threads on multi-core platforms, but is there anything else that I can do?
If its practical, you could use multiprocess to do this (assuming you have multiple cores on your machine):
NOTE: The following code will produce 40 pdfs in the present directory on your machine
import matplotlib.pyplot as plt
import multiprocessing
def do_plot(y_pos):
fig = plt.figure()
ax = plt.axes()
ax.axhline(y_pos)
fig.savefig('%s.pdf' % y_pos)
pool = multiprocessing.Pool()
for i in xrange(40):
pool.apply_async(do_plot, [i])
pool.close()
pool.join()
It doesn't scale perfectly, but I get a significant boost by doing this on my 4 cores (dual-core with hypertheading):
$> time python multi_pool_1.py
done
real 0m5.218s
user 0m4.901s
sys 0m0.205s
$> time python multi_pool_n.py
done
real 0m2.935s
user 0m9.022s
sys 0m0.420s
I'm sure there is a lot of scope for performance improvements on the pdf backend of mpl, but that is not on the timescale you are after.
HTH,
Matplotlib has a lot of overhead for creation of the figure, etc. even before saving it to pdf. So if your plots are similar you can safe a lot of "setting up" by reusing elements, just like you will find in animation examples for matplotlib.
You can reuse the figure and axes in this example:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
X = range(10)
Y = [ x**2 for x in X ]
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
for n in range(100):
ax.clear() # or even better just line.remove()
# but should interfere with autoscaling see also below about that
line = ax.plot(X, Y)[0]
fig.savefig("test.pdf")
Note that this does not help that much. You can save quite a bit more, by reusing the lines:
line = ax.plot(X, Y)[0]
for n in range(100):
# Now instead of plotting, we update the current line:
line.set_xdata(X)
line.set_ydata(Y)
# If autoscaling is necessary:
ax.relim()
ax.autoscale()
fig.savefig("test.pdf")
This is close to twice as fast as the initial example for me. This is only an option if you do similar plots, but if they are very similar, it can speed up things a lot. The matplotlib animation examples may have inspiration for this kind of optimization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With