Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data persistency of scientific simulation data, Mongodb + HDF5?

I'm developing a Monte Carlo simulation software package that involves multiple physics and simulators. I need to do online analysis, track of the dependency of derived data on raw data, and perform queries like "give me the waveforms for temperature>400 and position near (x0,y0)". So the in-memory data model is rather complicated.

The application is written in Python, with each simulation result modeled as a Python object. In every hour it produces ~100 results (objects). Most objects have heavy data (several MB of binary numeric array), as well as some light data (temperature, position etc). The total data generate rate is several GB per hour.

I need some data persistency solution, and an easy-to-use query API. I've already decided to store the heavy data (numeric array) in HDF5 storage(s). I'm considering using MongoDB as for object persistency (light data only), and for indexing the heavy data in HDF5. Object persistency with MongoDB is straightforward, and the query interface looks sufficiently powerful.

I am aware of the sqlalchemy+sqlite option. However, streaming the heavy data to HDF5 does not seem naturally supported in SqlAlchemy, and a fixed schema is cumbersome.

I am aware of this post( Searching a HDF5 dataset), but the "index table" itself needs some in-memory indices for fast query.

I wonder if there is any alternative solutions I should look at before I jump in? Or is there any problem I've overlooked in my plan?

TIA.

like image 642
Shen Chen Avatar asked Jan 25 '12 06:01

Shen Chen


1 Answers

Some things to know about Mongo which might be relevant to the situation you described and why it might be a good fit:

I need to do online analysis, track of the dependency of derived data on raw data, and perform queries like "give me the waveforms for temperature>400 and position near (x0,y0)".

Mongo has a flexible query language that makes it very easy to do queries like this. Geospatial (2D) indexes are also supported - plus if you need to do queries on position and temperature very frequently, you can create a compound index on (temperature, position) and this will ensure that the query will always perform well.

Most objects have heavy data (several MB of binary numeric array), as well as some light data (temperature, position etc).

Each document in MongoDB can hold up to 16MB of data, and a binary field type is also supported - so it would be relatively simple to embed a few megs of binary into a field, and retrieve it by querying other fields in the data. If you expect to need more than 16MB, you can also use the GridFS API of mongodb, which allows you to store arbitrarily large blobs of binary data on disk and retrieve them quickly.

The total data generate rate is several GB per hour.

For a large, rapidly growing data set like this, you can create a sharded setup which will allow you to add servers to accomodate the size no matter how large it may get.

like image 198
mpobrien Avatar answered Oct 21 '22 03:10

mpobrien