Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to import millions of files in R?

Tags:

import

r

csv

I have 15 million CSV files, each with two columns (integer and float), and between 5 and 500 rows. Each file looks something like:

3453,0.034
31,0.031
567,0.456
...

Currently, I am iterating over all the files, and using read.csv() to import each file into a big list. Here's a simplified version:

allFileNames = Sys.glob(sprintf("%s/*/*/results/*/*", dir))

s$scores = list()

for (i in 1:length(allFileNames)){
        if ((i %% 1000) == 0){
            cat(sprintf("%d of %d\n", i, length(allFileNames)))
        }

        fileName = allFileNames[i]
        approachID = getApproachID(fileName) 
        bugID = getBugID(fileName)

        size = file.info(fileName)$size
        if (!is.na(size) && size > 0){ # make sure file exists and is not empty
            tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric"))
            colnames(tmp) = c("fileCode", "score")
            s$scores[[approachID]][[bugID]]  = tmp
        } else {
            # File does not exist, or is empty. 
            s$scores[[approachID]][[bugID]] = matrix(-1, ncol=2, nrow=1)
        }
    }

tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric")

Later in my code, I go back through each matrix in the list, and calculate some metrics.

After starting this import process, it looks like it will take on the order of 3 to 5 days to complete. Is there a faster way to do this?

EDIT: I added more details about my code.

like image 916
stepthom Avatar asked Mar 23 '12 16:03

stepthom


People also ask

How do I import multiple CSV files into R?

Using readr Package You can consider this as a third option to load multiple CSV files into R DataFrame, This method uses the read_csv() function readr package. readr is a third-party library hence, in order to use readr library, you need to first install it by using install. packages('readr') .

Is fread faster than read CSV?

For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .

Is Vroom faster than fread?

table implementation is very fast) vroom is a bit slower than fread for pure numeric data. However because vroom is multi-threaded it is a bit quicker than readr and read.

How much data can you load into R?

R Objects live in memory entirely. Not possible to index objects with huge numbers of rows & columns even in 64 bit systems (2 Billion vector index limit) . Hits file size limit around 2-4 GB.


2 Answers

I'm not clear on your goal, but if you're trying to read all of these files into a single R data structure, then I see two major performance concerns:

  1. File access times - from the moment you request read.csv, a myriad of complex processes start on your machine involving seeing if that file exists, finding the location of that file in memory or on disk (and reading the data into memory, if need be), then interpreting the data within R. I would expect that this would be a nearly-constant slowdown as you read in millions of files.
  2. Growing your single data structure with each new file read. Every time you want to add a few rows to your matrix, you'll likely be needing to reallocate a similarly sized chunk of memory in order to store the larger matrix. If you're growing your array 15 million times, you'll certainly notice a performance slow-down here. With this problem, the performance will get progressively worse as your read in more files.

So do some quick profiling and see how long the reads are taking. If they're slowing down progressively as you read in more files, then let's focus on problem #2. If it's constantly slow, then let's worry about problem #1.

Regarding solutions, I'd say you could start with two things:

  1. Combine the CSV files in another programming language. A simple shell script would likely do the job for you if you're just looping through files and concatenating them into a single large file. As Joshua and Richie mention below, you may be able to optimize this without having to deviate to another language by using the more efficient scan() or readlines() functions.
  2. Pre-size your unified data structure. If you're using a matrix, for instance, set the number of rows to ~ 15 million x 100. That will ensure that you only have to find room in memory for this object once, and the rest of the operations will just insert data into the pre-sized matrix.

Add some more details of your code (what does the list look like that you're using?) and we may be able to be more helpful.

like image 164
Jeff Allen Avatar answered Oct 10 '22 05:10

Jeff Allen


Using scan (as Joshua state in comment) could be faster (3-4 times):

scan(fileName, what=list(0L,0.0), sep=",", dec=".", quiet=TRUE)

Main difference is that scan returns list with two elements and read.csv returns data.frame.

like image 38
Marek Avatar answered Oct 10 '22 05:10

Marek