I have multiple .xls (~100MB) files from which I would like to load multiple sheets (from each) into R as a dataframe. I have tried various functions, such as xlsx::xlsx2
and XLConnect::readWorksheetFromFile
, both of which always run for a very long time (>15 mins) and never finish and I have to force-quit RStudio to keep working.
I also tried gdata::read.xls
, which does finish, but it takes more than 3 minutes per one sheet and it cannot extract multiple sheets at once (which would be very helpful to speed up my pipeline) like XLConnect::loadWorkbook
can.
The time it takes these functions to execute (and I am not even sure the first two would ever finish if I let them go longer) is way too long for my pipeline, where I need to work with many files at once. Is there a way to get these to go/finish faster?
In several places, I have seen a recommendation to use the function readxl::read_xls
, which seems to be widely recommended for this task and should be faster per sheet. This one, however, gives me an error:
> # Minimal reproducible example:
> setwd("/Users/USER/Desktop")
> library(readxl)
> data <- read_xls(path="test_file.xls")
Error:
filepath: /Users/USER/Desktop/test_file.xls
libxls error: Unable to open file
I also did some elementary testing to make sure the file exists and is in the correct format:
> # Testing existence & format of the file
> file.exists("test_file.xls")
[1] TRUE
> format_from_ext("test_file.xls")
[1] "xls"
> format_from_signature("test_file.xls")
[1] "xls"
The test_file.xls
used above is available here.
Any advice would be appreciated in terms of making the first functions run faster or the read_xls
run at all - thank you!
It seems that some users are able to open the file above using the readxl::read_xls
function, while others are not, both on Mac and Windows, using the most up to date versions of R
, Rstudio
, and readxl
. The issue has been posted on the readxl GitHub and has not been resolved yet.
I downloaded your dataset and read each excel sheet in this way (for example, for sheets "Overall" and "Area"):
install.packages("readxl")
library(readxl)
library(data.table)
dt_overall <- as.data.table(read_excel("test_file.xls", sheet = "Overall"))
area_sheet <- as.data.table(read_excel("test_file.xls", sheet = "Area"))
Finally, I get dt like this (for example, only part of the dataset for the "Area" sheet):
Just as well, you can use the read_xls
function instead read_excel
.
I checked, it also works correctly and even a little faster, since read_excel
is a wrapper over read_xls
and read_xlsx
functions from readxl
package.
Also, you can use excel_sheets
function from readxl
package to read all sheets of your Excel file.
UPDATE
Benchmarking is done with microbenchmark
package for the following packages/functions: gdata::read.xls
, XLConnect::readWorksheetFromFile
and readxl::read_excel
.
But XLConnect
it's a Java-based solution, so it requires a lot of RAM.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With