I am using R off and on as a "backend" to Python and thus need to occassionaly import dataframes from R into Python; but I can't figure out how to import an R data.frame
as a Pandas DataFrame
.
For example if I create a dataframe in R
rdf = data.frame(a=c(2, 3, 5), b=c("aa", "bb", "cc"), c=c(TRUE, FALSE, TRUE))
and then pull it into Python using rmagic
with
%Rpull -d rdf
I get
array([(2.0, 1, 1), (3.0, 2, 0), (5.0, 3, 1)],
dtype=[('a', '<f8'), ('b', '<i4'), ('c', '<i4')])
I don't know what this is, and it's certainly not the
pd.DataFrame({'a': [2, 3, 5], 'b': ['aa', 'bb', 'cc'], 'c': [True, False, True]})
that I would expect.
The only thing that comes close to working for me is to use use a file to transfer the dataframe by writing in R
write.csv(data.frame(a=c(2, 3, 5), b=c("aa", "bb", "cc"), c=c(TRUE, FALSE, TRUE)), file="TEST.csv")
and then reading in Python
pd.read_csv("TEST.csv")
though even this approach produces an additional column: "Unnamed: 0".
What is the idiom for importing an R dataframe into Python as a Pandas dataframe?
First: array([(2.0, 1, 1), (3.0, 2, 0), (5.0, 3, 1)], dtype=[('a', '<f8'), ('b', '<i4'), ('c', '<i4')])
. That is a numpy
structured array
. http://docs.scipy.org/doc/numpy/user/basics.rec.html/. You can easily convert it to pandas
DF by using pd.DataFrame
:
In [65]:
from numpy import *
print pd.DataFrame(array([(2.0, 1, 1), (3.0, 2, 0), (5.0, 3, 1)], dtype=[('a', '<f8'), ('b', '<i4'), ('c', '<i4')]))
a b c
0 2 1 1
1 3 2 0
2 5 3 1
b
column is coded (as if factor()
'ed in R
), c
column was converted from boolean
to int
. a
was converted from int
to float
('<f8'
, actually I found that unexpected)
2nd, I think pandas.rpy.common
is the most convenient way of fetching data from R
: http://pandas.pydata.org/pandas-docs/stable/r_interface.html (It is probably too brief, so I will add another example here):
In [71]:
import pandas.rpy.common as com
DF=pd.DataFrame({'val':[1,1,1,2,2,3,3]})
r_DF = com.convert_to_r_dataframe(DF)
print pd.DataFrame(com.convert_robj(r_DF))
val
0 1
1 1
2 1
3 2
4 2
5 3
6 3
Finally, the Unnamed: 0
column is the index column. You can avoid it by providing index_col=0
to pd.read_csv()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With