Strategies for reading in CSV files in pieces?

Tags:

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix.

Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third?

Thanks to commenters for pointing out that I can potentially read in the whole file using some big memory tricks: Quickly reading very large tables as dataframes in R

I can think of some other workarounds (e.g. open in a good text editor, lop off 2/3 of the observations, then load in R), but I'd rather avoid them if possible.

So reading it in pieces still seems like the best way to go for now.

964

asked Feb 19 '12 20:02

Ari B. Friedman

1 Answers

After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!

1) Open a connection to your file

con = file("file.csv", "r")

2) Read in chunks of code with read.csv

read.csv(con, nrows="CHUNK SIZE",...)

Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.

3) Do what ever you need to do

4) Repeat.

5) Close the connection

close(con)

The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....

If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.

Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.

I hope this helps someone.

UPDATE

People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.

116

answered Sep 18 '22 13:09

Jacob H

Related questions
                            
                                Plot two Graphs on Same Chart R, ggplot2 par(mfrow())
                            
                                How to parallelelize do() calls with dplyr
                            
                                data.table | faster row-wise recursive update within group
                            
                                How to copy an object's structure (but not the data)
                            
                                How do I retrieve a matrix column and row name by a matrix index value?
                            
                                ggplot 2 facet_grid "free_y" but forcing Y axis to be rounded to nearest whole number
                            
                                Functional way to stack list of 2d matrices into 3d matrix
                            
                                shiny fluidrow column white space
                            
                                Launching R help: Error in file(out, "wt") : cannot open the connection
                            
                                Rename list items
                            
                                How to plot a hybrid boxplot: half boxplot with jitter points on the other half?
                            
                                Cumulative count of unique values in R
                            
                                R: read.csv adding sub-script "X" in header
                            
                                Logistic regression - defining reference level in R
                            
                                sum two columns in R
                            
                                R - when trying to install package: InternetOpenUrl failed
                            
                                Draw a trend line using ggplot
                            
                                Creating a function in R with variable number of arguments,
                            
                                How to use cast on a data frame?
                            
                                Extract date elements from POSIXlt and put into data frame in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Strategies for reading in CSV files in pieces?

Tags:

r

bigdata

Ari B. Friedman

People also ask

1 Answers

Jacob H

Recent Activity

Donate For Us