Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use hdf5 for large amounts of text data?

Suppose I am going to programmatically get hundred thousand open access books as text strings from the internet. My intention is to do some analysis on them (using pandas). I am already using mongodb in some parts of my application but I don't think its easy to put it in a pendrive and transfer it to a different machine. Sqlite is portable but I hate writing sql. The other options I have seen are to just put it in the filesystem as individual text files or in something called hdf5.

Is hdf5 good for this type of text-only data? If not, what other options are available?

like image 278
yayu Avatar asked Nov 18 '14 13:11

yayu


2 Answers

Yes you can but but if I were you, I would just use individual text files and zip the containing directory. Here is why:

Large arrays of numbers (HDF5's bread and butter) can be efficiently stored in binary format, but there is no binary text, so there is no advantage in terms of space to be gained by using HDF5. Yes you can enable compression within HDF5 files but you can as easily compress text files.

Both text files and zip files are pretty much universal these days, so there is nothing to gain in terms of portability.

Here is one example of something trivial you cannot do with HDF5: remove a dataset and reclaim its space.

Lastly, that's one more dependency for your project, whereas text files come for free in any programming language.

like image 94
Simon Avatar answered Oct 07 '22 12:10

Simon


It looks like it, yes.

From the HDF group website, and their description of HDF5: "HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data."

More information here: http://www.hdfgroup.org/HDF5/

Good luck!

like image 31
RDXdev Avatar answered Oct 07 '22 12:10

RDXdev