Is there some recommended way to pass data from R (in the form of <code>data.table</code>) to Python without having to save the data to disc? I know I could use python modules from R using <code>reticulate</code> (and I suppose the same thing can be done on the other side using <code>rpy2</code>), but from what I've read that hurts the overall performance of the libraries and therefore there is quite a big chance that it's better to store to disc my r <code>data.table</code> and read that same data from disc using python and running, say, <code>lightgbm</code>, than to try to run <code>lightgbm</code> using <code>reticulate</code> or <code>data.table</code> using rpy2. Why don't I just stick to either R or Python: I prefer using r <code>data.table</code> (as opposed to <code>Pandas</code>) for my data manipulations, because it is way faster, more memory efficient, and has a lot of features which I like, such as inequi joins, rolling joins, cartesian joins, and pretty straightforward melting and casting. I also like that whenever I ask a <code>data.table</code> related question in stack overflow, I get a high-quality answer pretty fast, while for <code>Pandas</code> i haven't been so successful. However, there are tasks for which I prefer python, such as when it comes to gradient boosting or neural networks.

There is no recommended way. In theory you have to dump R data.frame to disk and read it in python. In practice (assuming production grade operating system), you can use "RAM disk" location <code>/dev/shm/</code> so you essentially write data to a file that resides in RAM memory and then read it from RAM directly, without the need to dump data to disk memory. Example usage: <pre class="prettyprint lang-r prettyprint-override"><code>fwrite(iris, "/dev/shm/data.csv") d = fread("/dev/shm/data.csv") unlink("/dev/shm/data.csv") </code></pre> As for the format, you have the following options: <ol> <li> <code>csv</code> - universal and portable format</li> </ol> <code>data.table</code>'s <code>fwrite</code> function is super fast and produces portable csv data file. Be sure to enable all cpu threads with <code>setDTthreads(0L)</code> before using <code>fwrite</code> on a multi-core machine. Then in python you need to read csv file, for which python <code>datatable</code> module will be very fast, and then, if needed, object can be converted to python pandas using <code>x.to_pandas()</code>. <ol start="2"> <li> <code>feather</code> - "portable" binary format</li> </ol> Another option is to use R's <code>arrow</code> package and function <code>write_feather</code>, and then read data in python using <code>pyarrow</code> module and <code>read_feather</code>. This format should be faster than csv in most cases, see timings below. In case of writing data the difference might not be that big, but reading data will be much faster in most cases, especially when it comes to reading many character variables in R (although it is not your use case because you read in python). On the other hand it is not really portable yet (see apache/arrow#8732). Moreover, eventually if new version 3 will be released, then files saved with current feather might not be compatible anymore. <ol start="3"> <li> <code>fst</code> - fast binary format</li> </ol> <code>fst</code> can be used as faster alternative to feather format but it is not yet possible to read fst data in python, so this method cannot be applied to solve your problem as of now. You can track progress of this FR in https://github.com/fstpackage/fst/issues/184 and when issue will be resolved, then it will probably address your question in the fastest manner. <hr> Using following scripts <pre class="prettyprint lang-r prettyprint-override"><code>library(data.table) setDTthreads(0L) ## 40 N = 1e8L x = setDT(lapply(1:10, function(...) sample.int(N))) system.time(arrow::write_feather(x, "/dev/shm/data.feather")) system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE)) rm(x) ## run python unlink(paste0("/dev/shm/data.",c("csv","feather"))) N = 1e8L x = setDT(lapply(1:10, function(...) runif(N))) system.time(arrow::write_feather(x, "/dev/shm/data.feather")) system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE)) rm(x) ## run python unlink(paste0("/dev/shm/data.",c("csv","feather"))) N = 1e7L x = setDT(lapply(1:10, function(...) paste0("id",sample.int(N)))) system.time(arrow::write_feather(x, "/dev/shm/data.feather")) system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE)) rm(x) ## run python unlink(paste0("/dev/shm/data.",c("csv","feather"))) </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>import datatable as dt import timeit import gc from pyarrow import feather gc.collect() t_start = timeit.default_timer() x = dt.fread("/dev/shm/data.csv") print(timeit.default_timer() - t_start, flush=True) gc.collect() t_start = timeit.default_timer() y = x.to_pandas() print(timeit.default_timer() - t_start, flush=True) del x, y gc.collect() t_start = timeit.default_timer() x = feather.read_feather("/dev/shm/data.feather", memory_map=False) print(timeit.default_timer() - t_start, flush=True) del x </code></pre> I got the following timings: <ul> <li>integer: <ul> <li>write: feather 2.7s vs csv 5.7s</li> <li>read: feather 2.8s vs csv 111s+3s</li> </ul> </li> <li>double: <ul> <li>write: feather 5.7s vs csv 10.8s</li> <li>read: feather 5.1s vs csv 180s+4.9s</li> </ul> </li> <li>character: <ul> <li>write: feather 50.2s vs csv 2.8s</li> <li>read: feather 35s vs csv 14s+16s</li> </ul> </li> </ul> Based on the presented data cases (1e8 rows for int/double, 1e7 rows for character; 10 columns: int/double/character) we can conclude the following: <ul> <li>writing int and double is around 2 times slower for csv than feather</li> <li>writing character is around 20 times faster for csv than feather</li> <li>reading int and double is much slower for csv than feather</li> <li>conversion int and double from python datatable to pandas is relatively cheap</li> <li>reading character is around 2 times faster for csv than feather</li> <li>conversion character from python datatable to pandas is expensive</li> </ul> Note that these are very basic data cases, be sure to check timings on your actual data.

What is the recommended way to access data from R data.table in python? Can I avoid writing data to disc?

Tags:

python

r

data.table

Is there some recommended way to pass data from R (in the form of data.table) to Python without having to save the data to disc? I know I could use python modules from R using reticulate (and I suppose the same thing can be done on the other side using rpy2), but from what I've read that hurts the overall performance of the libraries and therefore there is quite a big chance that it's better to store to disc my r data.table and read that same data from disc using python and running, say, lightgbm, than to try to run lightgbm using reticulate or data.table using rpy2.

Why don't I just stick to either R or Python:

I prefer using r data.table (as opposed to Pandas) for my data manipulations, because it is way faster, more memory efficient, and has a lot of features which I like, such as inequi joins, rolling joins, cartesian joins, and pretty straightforward melting and casting. I also like that whenever I ask a data.table related question in stack overflow, I get a high-quality answer pretty fast, while for Pandas i haven't been so successful. However, there are tasks for which I prefer python, such as when it comes to gradient boosting or neural networks.

954

asked Aug 18 '20 09:08

ira

1 Answers

There is no recommended way.

In theory you have to dump R data.frame to disk and read it in python.

In practice (assuming production grade operating system), you can use "RAM disk" location /dev/shm/ so you essentially write data to a file that resides in RAM memory and then read it from RAM directly, without the need to dump data to disk memory.

Example usage:

fwrite(iris, "/dev/shm/data.csv")
d = fread("/dev/shm/data.csv")
unlink("/dev/shm/data.csv")

As for the format, you have the following options:

csv - universal and portable format

data.table's fwrite function is super fast and produces portable csv data file. Be sure to enable all cpu threads with setDTthreads(0L) before using fwrite on a multi-core machine.

Then in python you need to read csv file, for which python datatable module will be very fast, and then, if needed, object can be converted to python pandas using x.to_pandas().

feather - "portable" binary format

Another option is to use R's arrow package and function write_feather, and then read data in python using pyarrow module and read_feather.

This format should be faster than csv in most cases, see timings below. In case of writing data the difference might not be that big, but reading data will be much faster in most cases, especially when it comes to reading many character variables in R (although it is not your use case because you read in python). On the other hand it is not really portable yet (see apache/arrow#8732). Moreover, eventually if new version 3 will be released, then files saved with current feather might not be compatible anymore.

fst - fast binary format

fst can be used as faster alternative to feather format but it is not yet possible to read fst data in python, so this method cannot be applied to solve your problem as of now. You can track progress of this FR in https://github.com/fstpackage/fst/issues/184 and when issue will be resolved, then it will probably address your question in the fastest manner.

Using following scripts

library(data.table)
setDTthreads(0L) ## 40

N = 1e8L
x = setDT(lapply(1:10, function(...) sample.int(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python

unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e8L
x = setDT(lapply(1:10, function(...) runif(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python

unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e7L
x = setDT(lapply(1:10, function(...) paste0("id",sample.int(N))))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python

unlink(paste0("/dev/shm/data.",c("csv","feather")))

import datatable as dt
import timeit
import gc
from pyarrow import feather

gc.collect()
t_start = timeit.default_timer()
x = dt.fread("/dev/shm/data.csv")
print(timeit.default_timer() - t_start, flush=True)
gc.collect()
t_start = timeit.default_timer()
y = x.to_pandas()
print(timeit.default_timer() - t_start, flush=True)
del x, y

gc.collect()
t_start = timeit.default_timer()
x = feather.read_feather("/dev/shm/data.feather", memory_map=False)
print(timeit.default_timer() - t_start, flush=True)
del x

I got the following timings:

integer:
- write: feather 2.7s vs csv 5.7s
- read: feather 2.8s vs csv 111s+3s
double:
- write: feather 5.7s vs csv 10.8s
- read: feather 5.1s vs csv 180s+4.9s
character:
- write: feather 50.2s vs csv 2.8s
- read: feather 35s vs csv 14s+16s

Based on the presented data cases (1e8 rows for int/double, 1e7 rows for character; 10 columns: int/double/character) we can conclude the following:

writing int and double is around 2 times slower for csv than feather
writing character is around 20 times faster for csv than feather
reading int and double is much slower for csv than feather
conversion int and double from python datatable to pandas is relatively cheap
reading character is around 2 times faster for csv than feather
conversion character from python datatable to pandas is expensive

Note that these are very basic data cases, be sure to check timings on your actual data.

answered Sep 19 '22 17:09

jangorecki

Related questions
                            
                                How to calculate pairwise distance matrix on the GPU
                            
                                boto EMR add step and auto terminate
                            
                                Tensorflow Convolution Neural Network with different sized images
                            
                                Airflow Worker Configuration
                            
                                Cross-argument validation in argparse
                            
                                How to detect when subprocess asks for input in Windows
                            
                                Precision Measurement with Opencv python
                            
                                Dynamically excluding field from Django ModelForm
                            
                                Plot two levels of x_ticklabels on a pandas multi-index dataframe [duplicate]
                            
                                How to manage a single aiohttp.ClientSession?
                            
                                Python OCR: ignore signatures in documents
                            
                                Why is a NamedTuple containing mutable objects hashable, when a Tuple containing mutable objects is not?
                            
                                Keras reports TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
                            
                                Python 3 openpyxl UserWarning: Data Validation extension not supported
                            
                                Why does my login to MS SQL with AzureML dataprep using Windows authentication fail?
                            
                                How to plot the outline of the outer edges on a Matplotlib line in Python?
                            
                                Is there a pythonic way to decouple optional functionality from a function's main purpose?
                            
                                How to convert a JSON result to Parquet in python?
                            
                                Is " finished with status: 'crash' " normal for Cloud Functions?
                            
                                Conda: The following packages are missing from the target environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With