Python particles simulator: out-of-core processing

Question

Problem description

In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.

The number of particles (n_particles) is 10-100. The number of time-steps (time_size) is ~10^9.

The simulation has 3 steps (the code below is for an all-in-RAM version):

Simulate (and store) an emission rate array (contains many almost-0 elements):
- shape (n_particles x time_size), float32, size 80GB
Compute counts array, (random values from a Poisson process with previously computed rates):
- shape (n_particles x time_size), uint8, size 20GB
```
counts = np.random.poisson(lam=emission).astype(np.uint8)
```
Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.
```
# Loop across the particles
timestamps = [np.nonzero(c) for c in counts]
```

I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission (apply cumsum or other functions) before computing counts.

Question

I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.

What I would like it exist

I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).

Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max, cumsum, ...) to work transparently, requiring only a chunksize keyword to specify how much of the array to load at each iteration.

Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.

What are the practical options

As a first option I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.

So my options for which I seek an answer are

implement a numpy array with required functionality (how?)
use pytable in a smarter way (different data-structures/methods)
use another library: h5py, blaze, pandas... (I haven't tried any of them so far).

Tentative solution (pyTables)

I save the simulation parameters in '/parameters' group: each parameter is converted to a numpy array scalar. Verbose solution but it works.

I save emission as an Extensible array (EArray), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2'). I'm not sure this is space-efficient, and generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):

    dt = np.dtype([('counts', 'u1')])        
    for ip in xrange(n_particles):
        name = "particle_%d" % ip
        data_file.create_table(
                    group, name, description=dt, chunkshape=chunksize,
                    expectedrows=time_size,
                    title='Binned timetrace of emitted ph (bin = t_step)'
                        ' - particle_%d' % particle)

Each particle-counts "table" has a different name (name = "particle_%d" % ip) and that I need to put them in a python list for easy iteration.

EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.

MRocklin · Accepted Answer

Dask.array can perform chunked operations like max, cumsum, etc. on an on-disk array like PyTables or h5py.

import h5py
d = h5py.File('myfile.hdf5')['/data']
import dask.array as da
x = da.from_array(d, chunks=(1000, 1000))

X looks and feels like a numpy array and copies much of the API. Operations on x will create a DAG of in-memory operations which will execute efficiently using multiple cores streaming from disk as necessary

da.exp(x).mean(axis=0).compute()

http://dask.pydata.org/en/latest/

conda install dask
or 
pip install dask

Python particles simulator: out-of-core processing

Tags:

pandas

numpy

h5py

pytables

blaze

Problem description

Question

What I would like it exist

What are the practical options

Tentative solution (pyTables)

user2304916

Video Answer

1 Answers

MRocklin

Recent Activity

Donate For Us

Python particles simulator: out-of-core processing

Tags:

pandas

numpy

h5py

pytables

blaze

Problem description

Question

What I would like it exist

What are the practical options

Tentative solution (pyTables)

user2304916

Video Answer

1 Answers

MRocklin

Related questions

Recent Activity

Donate For Us