Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to read xlsx files into R

Tags:

It is a follow-up question to this one. What is the fastest way to read .xlsx files into R ?

I use library(xlsx) to read in data from 36 .xlsx files. It works. However, the problem is that this is very time consuming (well over 30 minutes), especially when considering the data in each file is not that large (matrix of size 3*3652 in each file). To this end, is there a better to deal with such problem, please? Is there another quick way to read .xlsx into R? Or can I put the 36 files into a single csv file quickly and then read into R?

Moreover, I just realised that readxl cannot write xlsx. Is there a counterpart of it to deal with writing instead of reading?

"Response to those voted this question down":

This question is about fact instead of the so-called "opinionated answers and spam" because speed is time and time is fact but NOT opinion.

Further update:

Perhaps one can explain to us in plain language why some method works much faster than others. I am certainly confused about this.

like image 527
LaTeXFan Avatar asked Jun 14 '17 07:06

LaTeXFan


2 Answers

Here is a small benchmark test. Results: readxl::read_xlsx on average about twice as fast as openxlsx::read.xlsx across different number of rows (n) and columns (p) using standard settings.

enter image description here

options(scipen=999)  # no scientific number format

nn <- c(1, 10, 100, 1000, 5000, 10000, 20000, 30000)
pp <- c(1, 5, 10, 20, 30, 40, 50)

# create some excel files
l <- list()  # save results
tmp_dir <- tempdir()

for (n in nn) {
  for (p in pp) {
    name <-  
    cat("\n\tn:", n, "p:", p)
    flush.console()
    m <- matrix(rnorm(n*p), n, p)
    file <- paste0(tmp_dir, "/n", n, "_p", p, ".xlsx")

    # write
    write.xlsx(m, file)

    # read
    elapsed <- system.time( x <- openxlsx::read.xlsx(file) )["elapsed"]
    df <- data.frame(fun = "openxlsx::read.xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

    elapsed <- system.time( x <- readxl::read_xlsx(file) )["elapsed"]
    df <- data.frame(fun = "readxl::read_xlsx", n = n, p = p, 
                     elapsed = elapsed, stringsAsFactors = F, row.names = NULL)
    l <- append(l, list(df))

  }
}

# results 
d <- do.call(rbind, l)

library(ggplot2)

ggplot(d, aes(n, elapsed, color= fun)) + 
  geom_line() + geom_point() +  
  facet_wrap( ~ paste("columns:", p)) +
  xlab("Number of rows") +
  ylab("Seconds")
like image 71
Mark Heckmann Avatar answered Sep 25 '22 12:09

Mark Heckmann


To write an excel file, readxl has a counterpart called writexl. As far as what is the best package to read an excel file, I think the benchmark provided above is pretty good.

The only reason I would use xlsx to write a package would be if I were to write many excel sheets in one .xlsx file.

like image 29
Shawn Brar Avatar answered Sep 25 '22 12:09

Shawn Brar