Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently save a python pandas dataframe in hdf5 and open it as a dataframe in R?

I think the title covers the issue, but to elucidate:

The pandas python package has a DataFrame data type for holding table data in python. It also has a convenient interface to the hdf5 file format, so pandas DataFrames (and other data) can be saved using a simple dict-like interface (assuming you have pytables installed)

import pandas 
import numpy
d = pandas.HDFStore('data.h5')
d['testdata'] = pandas.DataFrame({'N': numpy.random.randn(5)})
d.close()

So far so good. However, if I then try to load that same hdf5 into R I see things aren't so simple:

> library(hdf5)
> hdf5load('data.h5')
NULL
> testdata
$block0_values
         [,1]      [,2]      [,3]       [,4]      [,5]
[1,] 1.498147 0.8843877 -1.081656 0.08717049 -1.302641
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"

$block0_items
[1] "N"
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "string"
attr(,"name")
[1] "N."

$axis1
[1] 0 1 2 3 4
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "integer"
attr(,"name")
[1] "N."

$axis0
[1] "N"
attr(,"CLASS")
[1] "ARRAY"
attr(,"VERSION")
[1] "2.3"
attr(,"TITLE")
[1] ""
attr(,"FLAVOR")
[1] "numpy"
attr(,"kind")
[1] "string"
attr(,"name")
[1] "N."

attr(,"TITLE")
[1] ""
attr(,"CLASS")
[1] "GROUP"
attr(,"VERSION")
[1] "1.0"
attr(,"ndim")
[1] 2
attr(,"axis0_variety")
[1] "regular"
attr(,"axis1_variety")
[1] "regular"
attr(,"nblocks")
[1] 1
attr(,"block0_items_variety")
[1] "regular"
attr(,"pandas_type")
[1] "frame"

Which brings me to my question: ideally I would be able to save back and forth from R to pandas. I can obviously write a wrapper from pandas to R (I think... though I think if I use a pandas MultiIndex that might become trickier), but I don't think I can easily then use that data back in pandas. Any suggestions?

Bonus: what I really want to do is use the data.table package in R with a pandas dataframe (the keying approach is suspiciously similar in both packages). Any help on that one greatly appreciated.

like image 602
Griffith Rees Avatar asked Sep 05 '12 09:09

Griffith Rees


1 Answers

If you are still looking at this, take a look at this post on google groups. It shows how to exchange data between pandas/R via HDF5.

https://groups.google.com/forum/?fromgroups#!topic/pydata/0LR72GN9p6w

like image 150
Jeff Avatar answered Sep 20 '22 12:09

Jeff