Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

exporting from/importing to numpy, scipy in SQLite and HDF5 formats

There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) -- I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with "scientific" modules (numpy, scipy) for each data format (SQLite and HDF5).

like image 868
hatmatrix Avatar asked Oct 25 '11 01:10

hatmatrix


1 Answers

Most of it depends on your use case.

I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can't comment too much on SQLite libraries for python...

At least as far as h5py vs pyTables, they both offer very seamless access via numpy arrays, but they're oriented towards very different use cases.

If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it's much more simple to use h5py. If you have data that's more table-like, and you want to query it, then pyTables is a much better option.

h5py is a relatively "vanilla" wrapper around the HDF5 libraries compared to pyTables. This is a very good thing if you're going to be regularly accessing your HDF file from another language (pyTables adds some extra metadata). h5py can do a lot, but for some use cases (e.g. what pyTables does) you're going to need to spend more time tweaking things.

pyTables has some really nice features. However, if your data doesn't look much like a table, then it's probably not the best option.

To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They're homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset. h5py makes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary... Obviously, if you want to grab an entire "X" slice, then a C-ordered memmapped array is impossible to beat, as all the data in an "X" slice are adjacent on disk.)

As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data. pyTables makes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)

like image 115
Joe Kington Avatar answered Oct 08 '22 02:10

Joe Kington