Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Protocol Buffers, HDF5, NumPy comparison (transferring data)

Tags:

python

hdf5

numpy

I need help to make decision. I have a need to transfer some data in my application and have to make a choice between these 3 technologies. I've read about all technologies a little bit (tutorials, documentation) but still can't decide...

How do they compare?

I need support of metadata (capability to receive file and read it without any additional information/files), fast read/write operations, capability to store dynamic data will be a plus (like Python objects)

Things I already know:

  • NumPy is pretty fast but can't store dynamic data (like Python objects). (What about metadata?)
  • HDF5 is very fast, supports custom attributes, is easy to use, but can't store Python objects. Also HDF5 serializes NumPy data natively, so, IMHO, NumPy has no advantages over HDF5
  • Google Protocol Buffers support self-describing too, are pretty fast (but Python support is poor at present time, slow and buggy). CAN store dynamic data. Minuses - self-describing don't work from Python and messages that are >= 1 MB are serializing/deserializing not very fast (read "slow").

PS: data I need to transfer is "result of work" of NumPy/SciPy (arrays, arrays of complicated structs, etc.)

UPD: cross-language access required (C/C++/Python)

like image 275
illegal-immigrant Avatar asked Nov 08 '10 16:11

illegal-immigrant


Video Answer


2 Answers

There does seem to be a slight contradiction in your question - you want to be able to store Python objects, but you also want C/C++ access. I think that regardless of which choice you go with, you will need to convert your fancy Python data structures into more static structures such as arrays.

If you need cross-language access, I would suggest using HDF5 as it is a file format which is specifically designed to be independent of language, operating system, system architecture (e.g. on loading it can convert between big-endian and little-endian automatically) and is specifically aimed at users doing scientific/numerical computing. I don't know much about Google Protocol Buffers, so I can't really comment too much on that.

If you decide to go with HDF5, I would also recommend that you use h5py instead of pytables. This is because pytables creates HDF5 files with a whole lot of extra pythonic metadata which makes reading the data in C/C++ a bit more of a pain, whereas h5py doesn't create any of these extras. You can find a comparison here, and they also give a link to the pytables FAQ for their view on the matter so you can decide which suits your needs best.

Another format which is very similar to HDF5 is NetCDF. This also has Python bindings, however I have no experience in using this format so I cannot really comment beyond pointing out that it exists and is also widely used in scientific computing.

like image 140
DaveP Avatar answered Nov 01 '22 06:11

DaveP


I don't know about HDF5, but you can store Python objects in NumPy arrays, you just lose all the important functionality by disallowing C-level operations to be performed on the array.

In [17]: x = np.zeros(10, dtype=np.object)
In [18]: x[3] = {'pants', 10}
In [19]: x
Out[19]: array([0, 0, 0, set([10, 'pants']), 0, 0, 0, 0, 0, 0], dtype=object)
like image 44
Autoplectic Avatar answered Nov 01 '22 07:11

Autoplectic