How to append several large data.table objects into a single data.table and export to csv quickly without running out of memory?

Tags:

The easy answer to this is "buy more RAM" but I am hoping to get a more constructive answer and learn something in the process.

I am running Windows 7 64-bit with 8GB of RAM.

I have several very large .csv.gz files (~450MB uncompressed) with the same exact header information that I read into R and perform some processing on. Then, I need to combine the processed R objects into a single master object and write back out to .csv on disk.

I do this same operation on multiple sets of files. As an example, I have 5 folders each with 6 csv.gz files in them. I need to end up with 5 master files, one for each folder.

My code looks something like the following:

for( loop through folders ){
    master.file = data.table()

    for ( loop through files ) {
        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        # append file to the running master.file
        if ( nrow(master.file) == 0 ) {
            master.file = file
        } else {
            master.file = rbindlist( list( master.file, file) )
        }
        rm( file, filename )
        gc()
    }

    write.csv( master.file, unique master filename, row.names = FALSE )

    rm( master.file )
    gc()

}

This code does not work. I get the cannot allocate memory error before it writes out the final csv. I was watching resource monitor while running this code and don't understand why it would be using 8GB of RAM to do this processing. The total of all the file sizes is roughly 2.7GB, so I was expecting that the maximum memory R would use is 2.7GB. But the write.csv operation seems to use the same amount of memory as the data object you are writing, so if you have a 2.7GB object in memory and try to write it out, you would be using 5.6 GB of memory.

This apparent reality, combined with using a for loop in which memory doesn't seem to be getting adequately freed up seems to be the problem.

I suspect that I could use the sqldf package as mentioned here and here but when I set the sqldf statement equal to an R variable I ended up with the same out of memory errors.

685

asked Dec 20 '13 22:12

Brian D

1 Answers

Update 12/23/2013 - The following solution works all in R without running out of memory (Thanks @AnandaMahto).
The major caveat with this method is that you must be absolutely sure that the files you reading in and writing out each time have exactly the same header columns, in exactly the same order, or your R processing code must ensure this since write.table does not check this for you.

for( loop through folders ){

    for ( loop through files ) {

        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        # append file to the running master.file
        if ( first time through inner loop) {
            write.table(file, 
                        "masterfile.csv", 
                        sep = ",", 
                        dec = ".", 
                        qmethod = "double", 
                        row.names = "FALSE")
        } else {
            write.table(file,
                        "masterfile.csv",
                        sep = ",",
                        dec = ".",
                        qmethod = "double",
                        row.names = "FALSE",
                        append = "TRUE",
                        col.names = "FALSE")
        }
        rm( file, filename )
        gc()
    }
    gc()
}

My Initial Solution:

for( loop through folders ){

    for ( loop through files ) {
        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        #write out the file
        write.csv( file, ... )
        rm( file, filename )
        gc()
    }        
    gc()
}

I then downloaded and installed GnuWin32's sed package and used Windows command line tools to append the files as follows:

copy /b *common_pattern*.csv master_file.csv

This appends together all of the individual .csv files whose names have the text pattern "common_pattern" in them, headers and all.

Then I use sed.exe to remove all but the first header line as follows:

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -i 2,${/header_pattern/d;} master_file.csv

-i tells sed to just overwrite the specified file (in-place).

2,$ tells sed to look at range from the 2nd row to the last row ($)

{/header_pattern/d;} tells sed to find all lines in the range with the text "header_pattern" in them and d delete these lines

In order to make sure this was doing what I wanted it to do, I first printed the lines I was planning to delete.

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -n 2,${/header_pattern/p;} master_file.csv

Works like a charm, I just wish I could do it all in R.

answered Nov 03 '22 19:11

Brian D

Related questions
                            
                                using data.table with multiple threads in R
                            
                                Change ggplot coloured lines in legend to squares or circles
                            
                                retain a class attribute of an R object when subsetting
                            
                                ggplot2: Reading maximum bar height from plot object containing geom_histogram
                            
                                Is rpart automatic pruning?
                            
                                Reading in a file - Warning Message
                            
                                "Non-zero exit status" error downloading XML and RCurl R packages
                            
                                how are vectors, matrices, and data frames implemented in R?
                            
                                Error in R CMD Check: Packages required but not available
                            
                                Search for mispellings of a word in a character vector with R - "inverse" spell checker
                            
                                Printing R data frame with column names in multiple lines
                            
                                Generate unique patterns of numbers (e.g. 1221 considered same pattern as 2112)
                            
                                Why am I seeing "Error: length(rows) == 1 is not TRUE" with ddply?
                            
                                Creating a custom Stat object in ggplot2
                            
                                lapply-ing with the "$" function
                            
                                Is there an easy way to tell if an R script made use of any function in a loaded package?
                            
                                Convert latitude/longitude to state plane coordinates
                            
                                Is paste a special argument in plotmath expressions?
                            
                                Unable to install Devtools package for R studio mounted on linux redhat server
                            
                                using python virtual env in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to append several large data.table objects into a single data.table and export to csv quickly without running out of memory?

Tags:

memory

r

csv

data.table

sqldf

Brian D

People also ask

1 Answers

Brian D

Recent Activity

Donate For Us