Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

base R faster than readr for reading multiple CSV files

Tags:

r

tidyverse

readr

There is a lot of documentation on how to read multiple CSVs and bind them into one data frame. I have 5000+ CSV files I need to read in and bind into one data structure.

In particular I've followed the discussion here: Issue in Loading multiple .csv files into single dataframe in R using rbind

The weird thing is that base R is much faster than any other solution I've tried.

Here's what my CSV looks like:

> head(PT)
  Line          Timestamp       Lane.01 Lane.02 Lane.03 Lane.04 Lane.05 Lane.06 Lane.07 Lane.08
1    PL1    05-Jan-16 07:17:36      NA      NA      NA      NA      NA      NA      NA      NA
2    PL1    05-Jan-16 07:22:38      NA      NA      NA      NA      NA      NA      NA      NA
3    PL1    05-Jan-16 07:27:41      NA      NA      NA      NA      NA      NA      NA      NA
4    PL1    05-Jan-16 07:32:43    9.98   10.36   10.41   10.16   10.10    9.97   10.07    9.59
5    PL1    05-Jan-16 07:37:45    9.65    8.87    9.88    9.86    8.85    8.75    9.19    8.51
6    PL1    05-Jan-16 07:42:47    9.14    8.98    9.29    9.04    9.01    9.06    9.12    9.08

I've created three methods for reading in and binding the data. The files are located in a separate directory which I define as:

dataPath <- "data"
PTfiles <- list.files(path=dataPath, full.names = TRUE)

Method 1: Base R

classes <- c("factor", "character", rep("numeric",8))

# build function to load data
load_data <- function(dataPath, classes) { 
   tables <- lapply(PTfiles, read.csv, colClasses=classes, na.strings=c("NA", ""))
   do.call(rbind, tables)
}

#clock
method1 <- system.time(
   PT <- load_data(path, classes)
)

Method 2: read_csv In this case I created a wrapper function for read_csv to use

#create wrapper function for read_csv
read_csv.wrap <- function(x) { read_csv(x, skip = 1, na=c("NA", ""),
                      col_names = c("tool", "timestamp", paste("lane", 1:8, sep="")),
                      col_types = 
                         cols(
                            tool = col_character(),
                            timestamp = col_character(),
                            lane1 = col_double(),
                            lane2 = col_double(),
                            lane3 = col_double(),
                            lane4 = col_double(),
                            lane5 = col_double(),
                            lane6 = col_double(),
                            lane7 = col_double(),
                            lane8 = col_double()
                           )
                     )
}

##
# Same as method 1, just uses read_csv instead of read.csv

load_data2 <- function(dataPath) { 
   tables <- lapply(PTfiles, read_csv.wrap)
   do.call(rbind, tables)
}

#clock
method2 <- system.time(
   PT2 <- load_data2(path)
)

Method 3: read_csv + dplyr::bind_rows

load_data3 <- function(dataPath) { 
   tables <- lapply(PTfiles, read_csv.wrap)
   dplyr::bind_rows(tables)
}

#clock
method3 <- system.time(
   PT3 <- load_data3(path)
)

What I can't figure out, is why read_csv and dplyr methods are slower for elapsed time when they should be faster. The CPU time is decreased, but why would the elapsed time (file system) increase? What's going on here?

Edit - I added the data.table method as suggested in the comments

Method 4 data.table

library(data.table)

load_data4 <- function(dataPath){
   tables <- lapply(PTfiles, fread)
   rbindlist(tables)
}

method4 <- system.time(
   PT4 <- load_data4(path)
)

The data.table method is the fastest from a CPU standpoint. But the question still stands on what is going on with the read_csv methods that makes them so slow.

> rbind(method1, method2, method3, method4)
        user.self sys.self elapsed
method1      0.56     0.39    1.35
method2      0.42     1.98   13.96
method3      0.36     2.25   14.69
method4      0.34     0.67    1.74
like image 583
Lloyd Christmas Avatar asked May 17 '17 19:05

Lloyd Christmas


People also ask

How do I read multiple CSV files in R?

In order to read multiple CSV files or all files from a folder in R, use data. table package. data. table is a third-party library hence, in order to use data.

What are the comparatively fast methods to read a file in R?

R's native file format is . Rds . These files can be imported and exported using readRDS() and saveRDS() for fast and space efficient data storage.

Is Fread faster than read csv?

For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .


1 Answers

I would do that in the terminal(Unix). I would put all files int the same folder and then navigate to that folder (in terminal), the use the following command to create only one CSV file:

cat *.csv > merged_csv_file.csv

One observation regarding this method is that the header of each file will show up in the middle of the observations. To solve this I would suggest you do:

Get just the header from the first file

head -2 file1.csv > merged_csv_file.csv

then skip the first "X" lines from the other files, with the folling command, where "X" is the number of lines to skip.

tail -n +3 -q file*.csv >> merged_csv_file.csv

-n +3 makes tail print lines from 3rd to the end, -q tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.

like image 185
Marcio Rodrigues Avatar answered Oct 19 '22 10:10

Marcio Rodrigues