Is there some recommended way to pass data from R (in the form of data.table
) to Python without having to save the data to disc? I know I could use python modules from R using reticulate
(and I suppose the same thing can be done on the other side using rpy2
), but from what I've read that hurts the overall performance of the libraries and therefore there is quite a big chance that it's better to store to disc my r data.table
and read that same data from disc using python and running, say, lightgbm
, than to try to run lightgbm
using reticulate
or data.table
using rpy2.
Why don't I just stick to either R or Python:
I prefer using r data.table
(as opposed to Pandas
) for my data manipulations, because it is way faster, more memory efficient, and has a lot of features which I like, such as inequi joins, rolling joins, cartesian joins, and pretty straightforward melting and casting. I also like that whenever I ask a data.table
related question in stack overflow, I get a high-quality answer pretty fast, while for Pandas
i haven't been so successful. However, there are tasks for which I prefer python, such as when it comes to gradient boosting or neural networks.
It offers fast and memory efficient: file reader and writer, aggregations, updates, equi, non-equi, rolling, range and interval joins, in a short and flexible syntax, for faster development. It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix. Since a data. table is a data.
To add or insert observation/row to an existing Data Frame in R, we use rbind() function. We can add single or multiple observations/rows to a Data Frame in R using rbind() function.
There is no recommended way.
In theory you have to dump R data.frame to disk and read it in python.
In practice (assuming production grade operating system), you can use "RAM disk" location /dev/shm/
so you essentially write data to a file that resides in RAM memory and then read it from RAM directly, without the need to dump data to disk memory.
Example usage:
fwrite(iris, "/dev/shm/data.csv")
d = fread("/dev/shm/data.csv")
unlink("/dev/shm/data.csv")
As for the format, you have the following options:
csv
- universal and portable formatdata.table
's fwrite
function is super fast and produces portable csv data file. Be sure to enable all cpu threads with setDTthreads(0L)
before using fwrite
on a multi-core machine.
Then in python you need to read csv file, for which python datatable
module will be very fast, and then, if needed, object can be converted to python pandas using x.to_pandas()
.
feather
- "portable" binary formatAnother option is to use R's arrow
package and function write_feather
, and then read data in python using pyarrow
module and read_feather
.
This format should be faster than csv in most cases, see timings below. In case of writing data the difference might not be that big, but reading data will be much faster in most cases, especially when it comes to reading many character variables in R (although it is not your use case because you read in python). On the other hand it is not really portable yet (see apache/arrow#8732). Moreover, eventually if new version 3 will be released, then files saved with current feather might not be compatible anymore.
fst
- fast binary formatfst
can be used as faster alternative to feather format but it is not yet possible to read fst data in python, so this method cannot be applied to solve your problem as of now. You can track progress of this FR in https://github.com/fstpackage/fst/issues/184 and when issue will be resolved, then it will probably address your question in the fastest manner.
Using following scripts
library(data.table)
setDTthreads(0L) ## 40
N = 1e8L
x = setDT(lapply(1:10, function(...) sample.int(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e8L
x = setDT(lapply(1:10, function(...) runif(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e7L
x = setDT(lapply(1:10, function(...) paste0("id",sample.int(N))))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python
unlink(paste0("/dev/shm/data.",c("csv","feather")))
import datatable as dt
import timeit
import gc
from pyarrow import feather
gc.collect()
t_start = timeit.default_timer()
x = dt.fread("/dev/shm/data.csv")
print(timeit.default_timer() - t_start, flush=True)
gc.collect()
t_start = timeit.default_timer()
y = x.to_pandas()
print(timeit.default_timer() - t_start, flush=True)
del x, y
gc.collect()
t_start = timeit.default_timer()
x = feather.read_feather("/dev/shm/data.feather", memory_map=False)
print(timeit.default_timer() - t_start, flush=True)
del x
I got the following timings:
Based on the presented data cases (1e8 rows for int/double, 1e7 rows for character; 10 columns: int/double/character) we can conclude the following:
Note that these are very basic data cases, be sure to check timings on your actual data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With