I'm storing pandas data frames dumped in HDF format on S3. I'm pretty much stuck as I can't pass the file pointer, the URL, the s3 URL or a StringIO object to read_hdf
. If I understand it correctly the file must be present on the filesystem.
Source: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L315
It looks like it's implemented for CSV but not for HDF. Is there any better way to open those HDF files than copy them to the filesystem?
For the record, these HDF files are being handled on a web server, that's why I don't want a local copy.
If I need to stick with the local file: Is there any way to emulate that file on the filesystem (with a real path) which can be destroyed after the reading is done?
I'm using Python 2.7 with Django 1.9 and pandas 0.18.1.
df = numpy. read_hdf(fileName. hdf5) -> this stores the data into a numpy dataframe that you can use.
Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.
Pandas support directly uploading your files to S3 using pd. to_csv . It also supports feather and parquet files. However, the only drawback is that it will overwrite any existing file with the same name at the given S3 location.
Newer versions of python allow to read an hdf5 directly from S3 as mentioned in the read_hdf
documentation. Perhaps you should upgrade pandas if you can. This of course assumes you've set the right access rights to read those files: either with a credentials
file or with public ACLs.
Regarding your last comment, I am not sure why storing several HDF5 per df would necessarily be contra-indicated to the use of HDF5. Pickle should be much slower than HDF5 though joblib.dump
might partially improve on this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With