Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

*Efficiently* moving dataframes from Pandas to R with RPy (or other means)

I have a dataframe in Pandas, and I want to do some statistics on it using R functions. No problem! RPy makes it easy to send a dataframe from Pandas into R:

import pandas as pd
df = pd.DataFrame(index=range(100000),columns=range(100))
from rpy2 import robjects as ro
ro.globalenv['df'] = df

And if we're in IPython:

%load_ext rmagic
%R -i df

For some reason the ro.globalenv route is slightly slower than the rmagic route, but no matter. What matters is this: The dataframe I will ultimately be using is ~100GB. This presents a few problems:

  1. Even with just 1GB of data, the transfer is rather slow.
  2. If I understand correctly, this creates two copies of the dataframe in memory: one in Python, and one in R. That means I'll have just doubled my memory requirements, and I haven't even gotten to running statistical tests!

Is there any way to:

  1. transfer a large dataframe between Python and R more quickly?
  2. Access the same object in memory? I suspect this asking for the moon.
like image 997
jeffalstott Avatar asked May 03 '15 08:05

jeffalstott


2 Answers

rpy2 is using a conversion mechanism that is trying to avoid copying objects when moving between Python and R. However, this is currently only working in the direction R -> Python.

Python has an interface called the "buffer interface" that is used by rpy2 and that lets it minimize the number of copies for the C-level compatible between R and Python (see http://rpy.sourceforge.net/rpy2/doc-2.5/html/numpy.html#from-rpy2-to-numpy - the doc seems outdated as the __array_struct__ interface is no longer the primary choice).

There is no equivalent to the buffer interface in R, and the current concern holding me back from providing an equivalent functionality in rpy2 is the handling of borrowed references during garbage collection (and the lack of time to think sufficiently carefully about it).

So in summary there is a way to share data between Python and R without copying but this will require to have the data created in R.

like image 78
lgautier Avatar answered Oct 19 '22 22:10

lgautier


Currently, feather seems to be the most efficient option for data-interchange between DataFrame of R and pandas.

like image 33
TurtleIzzy Avatar answered Oct 19 '22 22:10

TurtleIzzy