Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently create 2d histograms from large datasets

I'd like to create 2d histograms in python from large datasets (100000+ samples) stored in a HDF5 file. I came up with the following code:

import sys
import h5py
import numpy as np
import matplotlib as mpl
import matplotlib.pylab

f = h5py.File(sys.argv[1], 'r')

A = f['A']
T = f['T']

at_hist, xedges, yedges = np.histogram2d(T, A, bins=500)
extent = [yedges[0], yedges[-1], xedges[0], xedges[-1]]

fig = mpl.pylab.figure()
at_plot = fig.add_subplot(111)

at_plot.imshow(at_hist, extent=extent, origin='lower', aspect='auto')

mpl.pylab.show()

f.close()

It takes about 15s to execute (100000 data points). CERN's Root however (using its own tree data structure instead of HDF5) can do this in less than 1s. Do you have any idea how I could speed up the code? I could also change the structure of the HDF5 data if it would be helpful.

like image 666
AbuBakr Avatar asked Jan 10 '12 15:01

AbuBakr


3 Answers

I would try a few different things.

  1. Load your data from the hdf file instead of passing in what are effectively memory-mapped arrays.
  2. If that doesn't fix the problem, you can exploit a scipy.sparse.coo_matrix to make the 2D histogram. With older versions of numpy, digitize (which all of the various histogram* functions use internally) could use excessive memory under some circumstances. It's no longer the case with recent (>1.5??) versions of numpy, though.

As an example of the first suggestion, you'd do something like:

f = h5py.File(sys.argv[1], 'r')
A = np.empty(f['A'].shape, f['A'].dtype)
T = np.empty(f['T'].shape, f['T'].dtype)
f['A'].read_direct(A)
f['T'].read_direct(T)

The difference here is that the entirety of the arrays will be read into memory, instead of being h5py's array-like objects, which are basically efficient memory-mapped arrays stored on disk.

As for the second suggestion, don't try it unless the first suggestion didn't help your problem.

It probably won't be significantly faster (and is likely slower for small arrays), and with recent versions of numpy, it's only slightly more memory-efficient. I do have a piece of code where I deliberately do this, but I wouldn't recommend it in general. It's a very hackish solution. In very select circumstances (many points and many bins), it can preform better than histogram2d, though.

All those caveats aside, though, here it is:

import numpy as np
import scipy.sparse
import timeit

def generate_data(num):
    x = np.random.random(num)
    y = np.random.random(num)
    return x, y

def crazy_histogram2d(x, y, bins=10):
    try:
        nx, ny = bins
    except TypeError:
        nx = ny = bins
    xmin, xmax = x.min(), x.max()
    ymin, ymax = y.min(), y.max()
    dx = (xmax - xmin) / (nx - 1.0)
    dy = (ymax - ymin) / (ny - 1.0)

    weights = np.ones(x.size)

    # Basically, this is just doing what np.digitize does with one less copy
    xyi = np.vstack((x,y)).T
    xyi -= [xmin, ymin]
    xyi /= [dx, dy]
    xyi = np.floor(xyi, xyi).T

    # Now, we'll exploit a sparse coo_matrix to build the 2D histogram...
    grid = scipy.sparse.coo_matrix((weights, xyi), shape=(nx, ny)).toarray()

    return grid, np.linspace(xmin, xmax, nx), np.linspace(ymin, ymax, ny)

if __name__ == '__main__':
    num=1e6
    numruns = 1
    x, y = generate_data(num)
    t1 = timeit.timeit('crazy_histogram2d(x, y, bins=500)',
            setup='from __main__ import crazy_histogram2d, x, y',
            number=numruns)
    t2 = timeit.timeit('np.histogram2d(x, y, bins=500)',
            setup='from __main__ import np, x, y',
            number=numruns)
    print 'Average of %i runs, using %.1e points' % (numruns, num)
    print 'Crazy histogram', t1 / numruns, 'sec'
    print 'numpy.histogram2d', t2 / numruns, 'sec'

On my system, this yields:

Average of 10 runs, using 1.0e+06 points
Crazy histogram 0.104092288017 sec
numpy.histogram2d 0.686891794205 sec
like image 135
Joe Kington Avatar answered Sep 18 '22 18:09

Joe Kington


You need to figure out if the bottleneck is in the data load or in histogram2d. Try inserting some time measurement into your code.

Are A and T arrays or are they generator objects? If the latter then more care is needed to distinguish where the bottleneck is; you might have to unpack them to numpy arrays first to test.

like image 43
Sideshow Bob Avatar answered Sep 20 '22 18:09

Sideshow Bob


Does the whole thing run in 15s or just the call to histogram2d? Importing the pylab family can take plenty of time. The numpy histogram2d function should be implemented in C if i recall correctly so I doubt there are performance issues there. You can speed up python by calling your script with the optimization flag --OO

python -OO script.py 

Also consider using Psycho to improve performance.

like image 32
edvaldig Avatar answered Sep 21 '22 18:09

edvaldig