Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the recommended way to access data from R data.table in python? Can I avoid writing data to disc?

Is there some recommended way to pass data from R (in the form of data.table) to Python without having to save the data to disc? I know I could use python modules from R using reticulate (and I suppose the same thing can be done on the other side using rpy2), but from what I've read that hurts the overall performance of the libraries and therefore there is quite a big chance that it's better to store to disc my r data.table and read that same data from disc using python and running, say, lightgbm, than to try to run lightgbm using reticulate or data.table using rpy2.

Why don't I just stick to either R or Python:

I prefer using r data.table (as opposed to Pandas) for my data manipulations, because it is way faster, more memory efficient, and has a lot of features which I like, such as inequi joins, rolling joins, cartesian joins, and pretty straightforward melting and casting. I also like that whenever I ask a data.table related question in stack overflow, I get a high-quality answer pretty fast, while for Pandas i haven't been so successful. However, there are tasks for which I prefer python, such as when it comes to gradient boosting or neural networks.

like image 954
ira Avatar asked Aug 18 '20 09:08

ira


People also ask

What does the data table () function provide to big data processing?

It offers fast and memory efficient: file reader and writer, aggregations, updates, equi, non-equi, rolling, range and interval joins, in a short and flexible syntax, for faster development. It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix. Since a data. table is a data.

How do I add data to a table in R?

To add or insert observation/row to an existing Data Frame in R, we use rbind() function. We can add single or multiple observations/rows to a Data Frame in R using rbind() function.


1 Answers

There is no recommended way.

In theory you have to dump R data.frame to disk and read it in python.

In practice (assuming production grade operating system), you can use "RAM disk" location /dev/shm/ so you essentially write data to a file that resides in RAM memory and then read it from RAM directly, without the need to dump data to disk memory.

Example usage:

fwrite(iris, "/dev/shm/data.csv")
d = fread("/dev/shm/data.csv")
unlink("/dev/shm/data.csv")

As for the format, you have the following options:

  1. csv - universal and portable format

data.table's fwrite function is super fast and produces portable csv data file. Be sure to enable all cpu threads with setDTthreads(0L) before using fwrite on a multi-core machine.

Then in python you need to read csv file, for which python datatable module will be very fast, and then, if needed, object can be converted to python pandas using x.to_pandas().

  1. feather - "portable" binary format

Another option is to use R's arrow package and function write_feather, and then read data in python using pyarrow module and read_feather.

This format should be faster than csv in most cases, see timings below. In case of writing data the difference might not be that big, but reading data will be much faster in most cases, especially when it comes to reading many character variables in R (although it is not your use case because you read in python). On the other hand it is not really portable yet (see apache/arrow#8732). Moreover, eventually if new version 3 will be released, then files saved with current feather might not be compatible anymore.

  1. fst - fast binary format

fst can be used as faster alternative to feather format but it is not yet possible to read fst data in python, so this method cannot be applied to solve your problem as of now. You can track progress of this FR in https://github.com/fstpackage/fst/issues/184 and when issue will be resolved, then it will probably address your question in the fastest manner.


Using following scripts

library(data.table)
setDTthreads(0L) ## 40

N = 1e8L
x = setDT(lapply(1:10, function(...) sample.int(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python

unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e8L
x = setDT(lapply(1:10, function(...) runif(N)))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python

unlink(paste0("/dev/shm/data.",c("csv","feather")))
N = 1e7L
x = setDT(lapply(1:10, function(...) paste0("id",sample.int(N))))
system.time(arrow::write_feather(x, "/dev/shm/data.feather"))
system.time(fwrite(x, "/dev/shm/data.csv", showProgress=FALSE))
rm(x)
## run python

unlink(paste0("/dev/shm/data.",c("csv","feather")))
import datatable as dt
import timeit
import gc
from pyarrow import feather

gc.collect()
t_start = timeit.default_timer()
x = dt.fread("/dev/shm/data.csv")
print(timeit.default_timer() - t_start, flush=True)
gc.collect()
t_start = timeit.default_timer()
y = x.to_pandas()
print(timeit.default_timer() - t_start, flush=True)
del x, y

gc.collect()
t_start = timeit.default_timer()
x = feather.read_feather("/dev/shm/data.feather", memory_map=False)
print(timeit.default_timer() - t_start, flush=True)
del x

I got the following timings:

  • integer:
    • write: feather 2.7s vs csv 5.7s
    • read: feather 2.8s vs csv 111s+3s
  • double:
    • write: feather 5.7s vs csv 10.8s
    • read: feather 5.1s vs csv 180s+4.9s
  • character:
    • write: feather 50.2s vs csv 2.8s
    • read: feather 35s vs csv 14s+16s

Based on the presented data cases (1e8 rows for int/double, 1e7 rows for character; 10 columns: int/double/character) we can conclude the following:

  • writing int and double is around 2 times slower for csv than feather
  • writing character is around 20 times faster for csv than feather
  • reading int and double is much slower for csv than feather
  • conversion int and double from python datatable to pandas is relatively cheap
  • reading character is around 2 times faster for csv than feather
  • conversion character from python datatable to pandas is expensive

Note that these are very basic data cases, be sure to check timings on your actual data.

like image 90
jangorecki Avatar answered Sep 19 '22 17:09

jangorecki