In writing a Monte Carlo particle simulator (brownian motion and photon emission) in python/numpy. I need to save the simulation output (>>10GB) to a file and process the data in a second step. Compatibility with both Windows and Linux is important.
The number of particles (n_particles
) is 10-100. The number of time-steps (time_size
) is ~10^9.
The simulation has 3 steps (the code below is for an all-in-RAM version):
Simulate (and store) an emission
rate array (contains many almost-0 elements):
n_particles
x time_size
), float32, size 80GB
Compute counts
array, (random values from a Poisson process with previously computed rates):
shape (n_particles
x time_size
), uint8, size 20GB
counts = np.random.poisson(lam=emission).astype(np.uint8)
Find timestamps (or index) of counts. Counts are almost always 0, so the timestamp arrays will fit in RAM.
# Loop across the particles
timestamps = [np.nonzero(c) for c in counts]
I do step 1 once, then repeat step 2-3 many (~100) times. In the future I may need to pre-process emission
(apply cumsum
or other functions) before computing counts
.
I have a working in-memory implementation and I'm trying to understand what is the best approach to implement an out-of-core version that can scale to (much) longer simulations.
I need to save arrays to a file, and I would like to use a single file for a simulation. I also need a "simple" way to store and recall a dictionary of simulation parameter (scalars).
Ideally I would like a file-backed numpy array that I can preallocate and fill in chunks. Then, I would like the numpy array methods (max
, cumsum
, ...) to work transparently, requiring only a chunksize
keyword to specify how much of the array to load at each iteration.
Even better, I would like a Numexpr that operates not between cache and RAM but between RAM and hard drive.
As a first option I started experimenting with pyTables, but I'm not happy with its complexity and abstractions (so different from numpy). Moreover my current solution (read below) is UGLY and not very efficient.
So my options for which I seek an answer are
implement a numpy array with required functionality (how?)
use pytable in a smarter way (different data-structures/methods)
use another library: h5py, blaze, pandas... (I haven't tried any of them so far).
I save the simulation parameters in '/parameters'
group: each parameter is converted to a numpy array scalar. Verbose solution but it works.
I save emission
as an Extensible array (EArray
), because I generate the data in chunks and I need to append each new chunk (I know the final size though). Saving counts
is more problematic. If a save it like a pytable array it's difficult to perform queries like "counts >= 2". Therefore I saved counts as multiple tables (one per particle) [UGLY] and I query with .get_where_list('counts >= 2')
. I'm not sure this is space-efficient, and
generating all these tables instead of using a single array, clobbers significantly the HDF5 file. Moreover, strangely enough, creating those tables require creating a custom dtype (even for standard numpy dtypes):
dt = np.dtype([('counts', 'u1')])
for ip in xrange(n_particles):
name = "particle_%d" % ip
data_file.create_table(
group, name, description=dt, chunkshape=chunksize,
expectedrows=time_size,
title='Binned timetrace of emitted ph (bin = t_step)'
' - particle_%d' % particle)
Each particle-counts "table" has a different name (name = "particle_%d" % ip
) and that I need to put them in a python list for easy iteration.
EDIT: The result of this question is a Brownian Motion simulator called PyBroMo.
Dask.array can perform chunked operations like max
, cumsum
, etc. on an on-disk array like PyTables or h5py.
import h5py
d = h5py.File('myfile.hdf5')['/data']
import dask.array as da
x = da.from_array(d, chunks=(1000, 1000))
X looks and feels like a numpy array and copies much of the API. Operations on x will create a DAG of in-memory operations which will execute efficiently using multiple cores streaming from disk as necessary
da.exp(x).mean(axis=0).compute()
http://dask.pydata.org/en/latest/
conda install dask
or
pip install dask
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With