Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Access HDF files stored on s3 in pandas

I'm storing pandas data frames dumped in HDF format on S3. I'm pretty much stuck as I can't pass the file pointer, the URL, the s3 URL or a StringIO object to read_hdf. If I understand it correctly the file must be present on the filesystem.

Source: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L315

It looks like it's implemented for CSV but not for HDF. Is there any better way to open those HDF files than copy them to the filesystem?

For the record, these HDF files are being handled on a web server, that's why I don't want a local copy.

If I need to stick with the local file: Is there any way to emulate that file on the filesystem (with a real path) which can be destroyed after the reading is done?

I'm using Python 2.7 with Django 1.9 and pandas 0.18.1.

like image 550
fodma1 Avatar asked Sep 07 '16 14:09

fodma1


People also ask

How do I read a HDF file in Python?

df = numpy. read_hdf(fileName. hdf5) -> this stores the data into a numpy dataframe that you can use.

What is HDF in pandas?

Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.

Can pandas write directly to S3?

Pandas support directly uploading your files to S3 using pd. to_csv . It also supports feather and parquet files. However, the only drawback is that it will overwrite any existing file with the same name at the given S3 location.


1 Answers

Newer versions of python allow to read an hdf5 directly from S3 as mentioned in the read_hdf documentation. Perhaps you should upgrade pandas if you can. This of course assumes you've set the right access rights to read those files: either with a credentials file or with public ACLs.

Regarding your last comment, I am not sure why storing several HDF5 per df would necessarily be contra-indicated to the use of HDF5. Pickle should be much slower than HDF5 though joblib.dump might partially improve on this.

like image 108
Louis MAYAUD Avatar answered Oct 13 '22 13:10

Louis MAYAUD