There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) -- I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with "scientific" modules (numpy, scipy) for each data format (SQLite and HDF5).
Most of it depends on your use case.
I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can't comment too much on SQLite libraries for python...
At least as far as h5py
vs pyTables
, they both offer very seamless access via numpy arrays, but they're oriented towards very different use cases.
If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it's much more simple to use h5py
. If you have data that's more table-like, and you want to query it, then pyTables
is a much better option.
h5py
is a relatively "vanilla" wrapper around the HDF5 libraries compared to pyTables
. This is a very good thing if you're going to be regularly accessing your HDF file from another language (pyTables
adds some extra metadata). h5py
can do a lot, but for some use cases (e.g. what pyTables
does) you're going to need to spend more time tweaking things.
pyTables
has some really nice features. However, if your data doesn't look much like a table, then it's probably not the best option.
To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They're homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset. h5py
makes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary... Obviously, if you want to grab an entire "X" slice, then a C-ordered memmapped array is impossible to beat, as all the data in an "X" slice are adjacent on disk.)
As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data. pyTables
makes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With