I have 15 million CSV files, each with two columns (integer and float), and between 5 and 500 rows. Each file looks something like: <pre class="prettyprint"><code>3453,0.034 31,0.031 567,0.456 ... </code></pre> Currently, I am iterating over all the files, and using <code>read.csv()</code> to import each file into a big list. Here's a simplified version: <pre class="prettyprint"><code>allFileNames = Sys.glob(sprintf("%s/*/*/results/*/*", dir)) s$scores = list() for (i in 1:length(allFileNames)){ if ((i %% 1000) == 0){ cat(sprintf("%d of %d\n", i, length(allFileNames))) } fileName = allFileNames[i] approachID = getApproachID(fileName) bugID = getBugID(fileName) size = file.info(fileName)$size if (!is.na(size) && size > 0){ # make sure file exists and is not empty tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric")) colnames(tmp) = c("fileCode", "score") s$scores[[approachID]][[bugID]] = tmp } else { # File does not exist, or is empty. s$scores[[approachID]][[bugID]] = matrix(-1, ncol=2, nrow=1) } } tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric") </code></pre> Later in my code, I go back through each matrix in the list, and calculate some metrics. After starting this import process, it looks like it will take on the order of 3 to 5 days to complete. Is there a faster way to do this? EDIT: I added more details about my code.

Using <code>scan</code> (as Joshua state in comment) could be faster (3-4 times): <pre class="prettyprint"><code>scan(fileName, what=list(0L,0.0), sep=",", dec=".", quiet=TRUE) </code></pre> Main difference is that <code>scan</code> returns list with two elements and <code>read.csv</code> returns <code>data.frame</code>.

Fastest way to import millions of files in R?

Tags:

import

r

csv

I have 15 million CSV files, each with two columns (integer and float), and between 5 and 500 rows. Each file looks something like:

3453,0.034
31,0.031
567,0.456
...

Currently, I am iterating over all the files, and using read.csv() to import each file into a big list. Here's a simplified version:

allFileNames = Sys.glob(sprintf("%s/*/*/results/*/*", dir))

s$scores = list()

for (i in 1:length(allFileNames)){
        if ((i %% 1000) == 0){
            cat(sprintf("%d of %d\n", i, length(allFileNames)))
        }

        fileName = allFileNames[i]
        approachID = getApproachID(fileName) 
        bugID = getBugID(fileName)

        size = file.info(fileName)$size
        if (!is.na(size) && size > 0){ # make sure file exists and is not empty
            tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric"))
            colnames(tmp) = c("fileCode", "score")
            s$scores[[approachID]][[bugID]]  = tmp
        } else {
            # File does not exist, or is empty. 
            s$scores[[approachID]][[bugID]] = matrix(-1, ncol=2, nrow=1)
        }
    }

tmp = read.csv(fileName, header=F, colClasses=c("integer", "numeric")

Later in my code, I go back through each matrix in the list, and calculate some metrics.

After starting this import process, it looks like it will take on the order of 3 to 5 days to complete. Is there a faster way to do this?

EDIT: I added more details about my code.

916

asked Mar 23 '12 16:03

stepthom

2 Answers

I'm not clear on your goal, but if you're trying to read all of these files into a single R data structure, then I see two major performance concerns:

File access times - from the moment you request read.csv, a myriad of complex processes start on your machine involving seeing if that file exists, finding the location of that file in memory or on disk (and reading the data into memory, if need be), then interpreting the data within R. I would expect that this would be a nearly-constant slowdown as you read in millions of files.
Growing your single data structure with each new file read. Every time you want to add a few rows to your matrix, you'll likely be needing to reallocate a similarly sized chunk of memory in order to store the larger matrix. If you're growing your array 15 million times, you'll certainly notice a performance slow-down here. With this problem, the performance will get progressively worse as your read in more files.

So do some quick profiling and see how long the reads are taking. If they're slowing down progressively as you read in more files, then let's focus on problem #2. If it's constantly slow, then let's worry about problem #1.

Regarding solutions, I'd say you could start with two things:

Combine the CSV files in another programming language. A simple shell script would likely do the job for you if you're just looping through files and concatenating them into a single large file. As Joshua and Richie mention below, you may be able to optimize this without having to deviate to another language by using the more efficient scan() or readlines() functions.
Pre-size your unified data structure. If you're using a matrix, for instance, set the number of rows to ~ 15 million x 100. That will ensure that you only have to find room in memory for this object once, and the rest of the operations will just insert data into the pre-sized matrix.

Add some more details of your code (what does the list look like that you're using?) and we may be able to be more helpful.

164

answered Oct 10 '22 05:10

Jeff Allen

Using scan (as Joshua state in comment) could be faster (3-4 times):

scan(fileName, what=list(0L,0.0), sep=",", dec=".", quiet=TRUE)

Main difference is that scan returns list with two elements and read.csv returns data.frame.

answered Oct 10 '22 05:10

Marek

Related questions
                            
                                how to add a (multipage) pdf to rmarkdown?
                            
                                Understanding degrees of freedom in lavaan
                            
                                Find variable combinations that makes Primary Key in R
                            
                                How to use shiny javascript functions?
                            
                                data.table alternative to piping
                            
                                Extend axis limits without plotting (in order to align two plots by x-unit)
                            
                                mutate_at to replace NAs with 0
                            
                                Is it possible to change the alignment of only 1 facet title
                            
                                Transform Identity Matrix
                            
                                How do I quickly group the time column in a dataframe into intervals?
                            
                                Turning field values into column names in an R data frame
                            
                                R: Applying a function to all row-pairs of a matrix without for loop
                            
                                R -- Vignettes that are not made by Sweave possible?
                            
                                ggplot2 plot table as lines
                            
                                Aggregate rows in a large matrix by rowname
                            
                                Save Excel spreadsheet as .csv with R?
                            
                                R gbm logistic regression
                            
                                Append a vector as a row in a CSV file
                            
                                Remove text inside brackets, parens, and/or braces
                            
                                Legend with different symbol sizes in base R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With