Suppose I am going to programmatically get hundred thousand open access books as text strings from the internet. My intention is to do some analysis on them (using pandas). I am already using mongodb in some parts of my application but I don't think its easy to put it in a pendrive and transfer it to a different machine. Sqlite is portable but I hate writing sql. The other options I have seen are to just put it in the filesystem as individual text files or in something called hdf5.
Is hdf5 good for this type of text-only data? If not, what other options are available?
Yes you can but but if I were you, I would just use individual text files and zip the containing directory. Here is why:
Large arrays of numbers (HDF5's bread and butter) can be efficiently stored in binary format, but there is no binary text, so there is no advantage in terms of space to be gained by using HDF5. Yes you can enable compression within HDF5 files but you can as easily compress text files.
Both text files and zip files are pretty much universal these days, so there is nothing to gain in terms of portability.
Here is one example of something trivial you cannot do with HDF5: remove a dataset and reclaim its space.
Lastly, that's one more dependency for your project, whereas text files come for free in any programming language.
It looks like it, yes.
From the HDF group website, and their description of HDF5: "HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data."
More information here: http://www.hdfgroup.org/HDF5/
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With