Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python particles simulator: out-of-core processing

Problem description

In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.

The number of particles (n_particles) is 10-100. The number of time-steps (time_size) is ~10^9.

The simulation has 3 steps (the code below is for an all-in-RAM version):

  1. Simulate (and store) an emission rate array (contains many almost-0 elements):

    • shape (n_particles x time_size), float32, size 80GB
  2. Compute counts array, (random values from a Poisson process with previously computed rates):

    • shape (n_particles x time_size), uint8, size 20GB

      counts = np.random.poisson(lam=emission).astype(np.uint8)
      
  3. Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.

    # Loop across the particles
    timestamps = [np.nonzero(c) for c in counts]
    

I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission (apply cumsum or other functions) before computing counts.

Question

I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.

What I would like it exist

I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).

Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max, cumsum, ...) to work transparently, requiring only a chunksize keyword to specify how much of the array to load at each iteration.

Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.

What are the practical options

As a first option I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.

So my options for which I seek an answer are

  1. implement a numpy array with required functionality (how?)

  2. use pytable in a smarter way (different data-structures/methods)

  3. use another library: h5py, blaze, pandas... (I haven't tried any of them so far).

Tentative solution (pyTables)

I save the simulation parameters in '/parameters' group: each parameter is converted to a numpy array scalar. Verbose solution but it works.

I save emission as an Extensible array (EArray), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2'). I'm not sure this is space-efficient, and generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):

    dt = np.dtype([('counts', 'u1')])        
    for ip in xrange(n_particles):
        name = "particle_%d" % ip
        data_file.create_table(
                    group, name, description=dt, chunkshape=chunksize,
                    expectedrows=time_size,
                    title='Binned timetrace of emitted ph (bin = t_step)'
                        ' - particle_%d' % particle)

Each particle-counts "table" has a different name (name = "particle_%d" % ip) and that I need to put them in a python list for easy iteration.

EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.

like image 369
user2304916 Avatar asked Jan 05 '14 23:01

user2304916


Video Answer


1 Answers

Dask.array can perform chunked operations like max, cumsum, etc. on an on-disk array like PyTables or h5py.

import h5py
d = h5py.File('myfile.hdf5')['/data']
import dask.array as da
x = da.from_array(d, chunks=(1000, 1000))

X looks and feels like a numpy array and copies much of the API. Operations on x will create a DAG of in-memory operations which will execute efficiently using multiple cores streaming from disk as necessary

da.exp(x).mean(axis=0).compute()

http://dask.pydata.org/en/latest/

conda install dask
or 
pip install dask
like image 167
MRocklin Avatar answered Oct 18 '22 16:10

MRocklin