Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up the performance of write.table

Tags:

r

I have a data.frame and I want to write it out. The dimensions of my data.frame are 256 rows by 65536 columns. What are faster alternatives to write.csv?

like image 866
lolibility Avatar asked May 08 '12 19:05

lolibility


4 Answers

data.table::fwrite() was contributed by Otto Seiskari and is available in versions 1.9.8+. Matt has made additional enhancements on top (including parallelisation) and wrote an article about it. Please report any issues on the tracker.

First, here's a comparison on the same dimensions used by @chase above (i.e., a very large number of columns: 65,000 columns (!) x 256 rows), together with fwrite and write_feather, so that we have some consistency across machines. Note the huge difference compress=FALSE makes in base R.

# -----------------------------------------------------------------------------
# function  | object type |  output type | compress= | Runtime | File size |
# -----------------------------------------------------------------------------
# save      |      matrix |    binary    |   FALSE   |    0.3s |    134MB  |
# save      |  data.frame |    binary    |   FALSE   |    0.4s |    135MB  |
# feather   |  data.frame |    binary    |   FALSE   |    0.4s |    139MB  |
# fwrite    |  data.table |    csv       |   FALSE   |    1.0s |    302MB  |
# save      |      matrix |    binary    |   TRUE    |   17.9s |     89MB  |
# save      |  data.frame |    binary    |   TRUE    |   18.1s |     89MB  |
# write.csv |      matrix |    csv       |   FALSE   |   21.7s |    302MB  |
# write.csv |  data.frame |    csv       |   FALSE   |  121.3s |    302MB  |

Note that fwrite() runs in parallel. The timing shown here is on a 13' Macbook Pro with 2 cores and 1 thread/core (+2 virtual threads via hyperthreading), 512GB SSD, 256KB/core L2 cache and 4MB L4 cache. Depending on your system spec, YMMV.

I also reran the benchmarks on relatively more likely (and bigger) data:

library(data.table)
NN <- 5e6 # at this number of rows, the .csv output is ~800Mb on my machine
set.seed(51423)
DT <- data.table(
  str1 = sample(sprintf("%010d",1:NN)), #ID field 1
  str2 = sample(sprintf("%09d",1:NN)),  #ID field 2
  # varying length string field--think names/addresses, etc.
  str3 = replicate(NN,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
  # factor-like string field with 50 "levels"
  str4 = sprintf("%05d",sample(sample(1e5,50),NN,T)),
  # factor-like string field with 17 levels, varying length
  str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T),
      collapse="")),NN,T),
  # lognormally distributed numeric
  num1 = round(exp(rnorm(NN,mean=6.5,sd=1.5)),2),
  # 3 binary strings
  str6 = sample(c("Y","N"),NN,T),
  str7 = sample(c("M","F"),NN,T),
  str8 = sample(c("B","W"),NN,T),
  # right-skewed (integer type)
  int1 = as.integer(ceiling(rexp(NN))),
  num2 = round(exp(rnorm(NN,mean=6,sd=1.5)),2),
  # lognormal numeric that can be positive or negative
  num3 = (-1)^sample(2,NN,T)*round(exp(rnorm(NN,mean=6,sd=1.5)),2))

# -------------------------------------------------------------------------------
# function  |   object   | out |        other args         | Runtime  | File size |
# -------------------------------------------------------------------------------
# fwrite    | data.table | csv |      quote = FALSE        |   1.7s   |  523.2MB  |
# fwrite    | data.frame | csv |      quote = FALSE        |   1.7s   |  523.2MB  |
# feather   | data.frame | bin |     no compression        |   3.3s   |  635.3MB  |
# save      | data.frame | bin |     compress = FALSE      |  12.0s   |  795.3MB  |
# write.csv | data.frame | csv |    row.names = FALSE      |  28.7s   |  493.7MB  |
# save      | data.frame | bin |     compress = TRUE       |  48.1s   |  190.3MB  |
# -------------------------------------------------------------------------------

So fwrite is ~2x faster than feather in this test. This was run on the same machine as noted above with fwrite running in parallel on 2 cores.

feather seems quite fast binary format as well, but no compression yet.


Here's an attempt at showing how fwrite compares with respect to scale:

NB: the benchmark has been updated by running base R's save() with compress = FALSE (since feather also is not compressed).

mb

So fwrite is fastest of all of them on this data (running on 2 cores) plus it creates a .csv which can easily be viewed, inspected and passed to grep, sed etc.

Code for reproduction:

require(data.table)
require(microbenchmark)
require(feather)
ns <- as.integer(10^seq(2, 6, length.out = 25))
DTn <- function(nn)
    data.table(
          str1 = sample(sprintf("%010d",1:nn)),
          str2 = sample(sprintf("%09d",1:nn)),
          str3 = replicate(nn,paste0(sample(LETTERS,sample(10:30,1),T), collapse="")),
          str4 = sprintf("%05d",sample(sample(1e5,50),nn,T)),
          str5 = sample(replicate(17,paste0(sample(LETTERS, sample(15:25,1),T), collapse="")),nn,T),
          num1 = round(exp(rnorm(nn,mean=6.5,sd=1.5)),2),
          str6 = sample(c("Y","N"),nn,T),
          str7 = sample(c("M","F"),nn,T),
          str8 = sample(c("B","W"),nn,T),
          int1 = as.integer(ceiling(rexp(nn))),
          num2 = round(exp(rnorm(nn,mean=6,sd=1.5)),2),
          num3 = (-1)^sample(2,nn,T)*round(exp(rnorm(nn,mean=6,sd=1.5)),2))

count <- data.table(n = ns,
                    c = c(rep(1000, 12),
                          rep(100, 6),
                          rep(10, 7)))

mbs <- lapply(ns, function(nn){
  print(nn)
  set.seed(51423)
  DT <- DTn(nn)
  microbenchmark(times = count[n==nn,c],
               write.csv=write.csv(DT, "writecsv.csv", quote=FALSE, row.names=FALSE),
               save=save(DT, file = "save.RData", compress=FALSE),
               fwrite=fwrite(DT, "fwrite_turbo.csv", quote=FALSE, sep=","),
               feather=write_feather(DT, "feather.feather"))})

png("microbenchmark.png", height=600, width=600)
par(las=2, oma = c(1, 0, 0, 0))
matplot(ns, t(sapply(mbs, function(x) {
  y <- summary(x)[,"median"]
  y/y[3]})),
  main = "Relative Speed of fwrite (turbo) vs. rest",
  xlab = "", ylab = "Time Relative to fwrite (turbo)",
  type = "l", lty = 1, lwd = 2, 
  col = c("red", "blue", "black", "magenta"), xaxt = "n", 
  ylim=c(0,25), xlim=c(0, max(ns)))
axis(1, at = ns, labels = prettyNum(ns, ","))
mtext("# Rows", side = 1, las = 1, line = 5)
legend("right", lty = 1, lwd = 3, 
       legend = c("write.csv", "save", "feather"),
       col = c("red", "blue", "magenta"))
dev.off()
like image 141
MichaelChirico Avatar answered Nov 08 '22 22:11

MichaelChirico


If all of your columns are of the same class, convert to a matrix before writing out, provides a nearly 6x speed up. Also, you can look into using write.matrix() from package MASS, though it did not prove faster for this example. Maybe I didn't set something up properly:

#Fake data
m <- matrix(runif(256*65536), nrow = 256)
#AS a data.frame
system.time(write.csv(as.data.frame(m), "dataframe.csv"))
#----------
#   user  system elapsed 
# 319.53   13.65  333.76 

#As a matrix
system.time(write.csv(m, "matrix.csv"))
#----------
#   user  system elapsed 
#  52.43    0.88   53.59 

#Using write.matrix()
require(MASS)
system.time(write.matrix(m, "writematrix.csv"))
#----------
#   user  system elapsed 
# 113.58   59.12  172.75 

EDIT

To address the concern raised below that the results above are not fair to data.frame, here are some more results and timing to show that the overall message is still "convert your data object to a matrix if possible. If not possible, deal with it. Alternatively, reconsider why you need to write out a 200MB+ file in CSV format if the timing is of the utmost importance":

#This is a data.frame
m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256))
#This is still 6x slower
system.time(write.csv(m2, "dataframe.csv"))
#   user  system elapsed 
# 317.85   13.95  332.44
#This even includes the overhead in converting to as.matrix in the timing 
system.time(write.csv(as.matrix(m2), "asmatrix.csv"))
#   user  system elapsed 
#  53.67    0.92   54.67 

So, nothing really changes. To confirm this is reasonable, consider the relative time costs of as.data.frame():

m3 <- as.matrix(m2)
system.time(as.data.frame(m3))
#   user  system elapsed 
#   0.77    0.00    0.77 

So, not really a big deal or skewing information as much as the comment below would believe. If you're still not convinced that using write.csv() on large data.frames is a bad idea performance wise, consult the manual under the Note:

write.table can be slow for data frames with large numbers (hundreds or more) of
columns: this is inevitable as each column could be of a different class and so must be
handled separately. If they are all of the same class, consider using a matrix instead.

Finally, consider moving to a native RData object if you're still losing sleep over saving things faster

system.time(save(m2, file = "thisisfast.RData"))
#   user  system elapsed 
#  21.67    0.12   21.81
like image 27
Chase Avatar answered Nov 08 '22 21:11

Chase


Another option is to use the feather file format.

df <- as.data.frame(matrix(runif(256*65536), nrow = 256))

system.time(feather::write_feather(df, "df.feather"))
#>   user  system elapsed 
#>  0.237   0.355   0.617 

Feather is a binary file format designed to be very efficient to read and write. It's designed to work with multiple languages: there are currently R and python clients, and a julia client is in the works.

For comparison, here's how long saveRDS takes:

system.time(saveRDS(df, "df.rds"))
#>   user  system elapsed 
#> 17.363   0.307  17.856

Now, this is a somewhat unfair comparison because the default for saveRDS is to compress the data, and here the data is incompressible because it's completely random. Turning compression off makes saveRDS significantly faster:

system.time(saveRDS(df, "df.rds", compress = FALSE))
#>   user  system elapsed 
#>  0.181   0.247   0.473     

And indeed it's now slightly faster than feather. So why use feather? Well, it's typically faster than readRDS(), and you usually write the data relatively few times compared to the number of times that you read it.

system.time(readRDS("df.rds"))
#>   user  system elapsed 
#>  0.198   0.090   0.287 

system.time(feather::read_feather("df.feather"))
#>   user  system elapsed 
#>  0.125   0.060   0.185 
like image 13
hadley Avatar answered Nov 08 '22 21:11

hadley


The fst package

A more recent option for very fast reading and writing of data files is the fst package. fst generates files in a binary format.

Use write.fst(dat, "file.fst", compress=0), where compress can go from 0 (no compression) to 100 (maximum compression). Data can be read back into R with dat = read.fst("file.fst"). Based on the timings listed at the package website, it's faster than feather, data.table and base R readRDS and writeRDS.

The package development site warns that the fst data format is still evolving and that fst should therefore not yet be used for long-term data storage.

like image 4
eipi10 Avatar answered Nov 08 '22 22:11

eipi10