I'd like to create 2d histograms in python from large datasets (100000+ samples) stored in a HDF5 file. I came up with the following code:
import sys
import h5py
import numpy as np
import matplotlib as mpl
import matplotlib.pylab
f = h5py.File(sys.argv[1], 'r')
A = f['A']
T = f['T']
at_hist, xedges, yedges = np.histogram2d(T, A, bins=500)
extent = [yedges[0], yedges[-1], xedges[0], xedges[-1]]
fig = mpl.pylab.figure()
at_plot = fig.add_subplot(111)
at_plot.imshow(at_hist, extent=extent, origin='lower', aspect='auto')
mpl.pylab.show()
f.close()
It takes about 15s to execute (100000 data points). CERN's Root however (using its own tree data structure instead of HDF5) can do this in less than 1s. Do you have any idea how I could speed up the code? I could also change the structure of the HDF5 data if it would be helpful.
I would try a few different things.
scipy.sparse.coo_matrix
to make the 2D histogram. With older versions of numpy, digitize
(which all of the various histogram*
functions use internally) could use excessive memory under some circumstances. It's no longer the case with recent (>1.5
??) versions of numpy, though.As an example of the first suggestion, you'd do something like:
f = h5py.File(sys.argv[1], 'r')
A = np.empty(f['A'].shape, f['A'].dtype)
T = np.empty(f['T'].shape, f['T'].dtype)
f['A'].read_direct(A)
f['T'].read_direct(T)
The difference here is that the entirety of the arrays will be read into memory, instead of being h5py
's array-like objects, which are basically efficient memory-mapped arrays stored on disk.
As for the second suggestion, don't try it unless the first suggestion didn't help your problem.
It probably won't be significantly faster (and is likely slower for small arrays), and with recent versions of numpy, it's only slightly more memory-efficient. I do have a piece of code where I deliberately do this, but I wouldn't recommend it in general. It's a very hackish solution. In very select circumstances (many points and many bins), it can preform better than histogram2d
, though.
All those caveats aside, though, here it is:
import numpy as np
import scipy.sparse
import timeit
def generate_data(num):
x = np.random.random(num)
y = np.random.random(num)
return x, y
def crazy_histogram2d(x, y, bins=10):
try:
nx, ny = bins
except TypeError:
nx = ny = bins
xmin, xmax = x.min(), x.max()
ymin, ymax = y.min(), y.max()
dx = (xmax - xmin) / (nx - 1.0)
dy = (ymax - ymin) / (ny - 1.0)
weights = np.ones(x.size)
# Basically, this is just doing what np.digitize does with one less copy
xyi = np.vstack((x,y)).T
xyi -= [xmin, ymin]
xyi /= [dx, dy]
xyi = np.floor(xyi, xyi).T
# Now, we'll exploit a sparse coo_matrix to build the 2D histogram...
grid = scipy.sparse.coo_matrix((weights, xyi), shape=(nx, ny)).toarray()
return grid, np.linspace(xmin, xmax, nx), np.linspace(ymin, ymax, ny)
if __name__ == '__main__':
num=1e6
numruns = 1
x, y = generate_data(num)
t1 = timeit.timeit('crazy_histogram2d(x, y, bins=500)',
setup='from __main__ import crazy_histogram2d, x, y',
number=numruns)
t2 = timeit.timeit('np.histogram2d(x, y, bins=500)',
setup='from __main__ import np, x, y',
number=numruns)
print 'Average of %i runs, using %.1e points' % (numruns, num)
print 'Crazy histogram', t1 / numruns, 'sec'
print 'numpy.histogram2d', t2 / numruns, 'sec'
On my system, this yields:
Average of 10 runs, using 1.0e+06 points
Crazy histogram 0.104092288017 sec
numpy.histogram2d 0.686891794205 sec
You need to figure out if the bottleneck is in the data load or in histogram2d
. Try inserting some time measurement into your code.
Are A and T arrays or are they generator objects? If the latter then more care is needed to distinguish where the bottleneck is; you might have to unpack them to numpy arrays first to test.
Does the whole thing run in 15s or just the call to histogram2d? Importing the pylab family can take plenty of time. The numpy histogram2d function should be implemented in C if i recall correctly so I doubt there are performance issues there. You can speed up python by calling your script with the optimization flag --OO
python -OO script.py
Also consider using Psycho to improve performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With