I have a dataframe in Pandas, and I want to do some statistics on it using R functions. No problem! RPy makes it easy to send a dataframe from Pandas into R:
import pandas as pd
df = pd.DataFrame(index=range(100000),columns=range(100))
from rpy2 import robjects as ro
ro.globalenv['df'] = df
And if we're in IPython:
%load_ext rmagic
%R -i df
For some reason the ro.globalenv
route is slightly slower than the rmagic
route, but no matter. What matters is this: The dataframe I will ultimately be using is ~100GB. This presents a few problems:
Is there any way to:
rpy2
is using a conversion mechanism that is trying to avoid copying objects when moving between Python and R. However, this is currently only working in the direction R -> Python.
Python has an interface called the "buffer interface" that is used by rpy2
and that lets it minimize the number of copies for the C-level compatible between R and Python (see http://rpy.sourceforge.net/rpy2/doc-2.5/html/numpy.html#from-rpy2-to-numpy - the doc seems outdated as the __array_struct__
interface is no longer the primary choice).
There is no equivalent to the buffer interface in R, and the current concern holding me back from providing an equivalent functionality in rpy2
is the handling of borrowed references during garbage collection (and the lack of time to think sufficiently carefully about it).
So in summary there is a way to share data between Python and R without copying but this will require to have the data created in R.
Currently, feather
seems to be the most efficient option for data-interchange between DataFrame of R and pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With