So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it. So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R? Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R? Any ideas would be greatly appreciated. Update: <ul> <li>Constraints</li> <ul> <li>Needs to use my machine, so no EC2 instances</li> <li>As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...</li> <li>As you can see below, the data contains mixed types, which I need to operate on later</li> </ul> <li>Data</li> <ul> <li>The data is 3.5GB, with about 8.5 million rows and 17 columns</li> <li>A couple thousand rows (~2k) are malformed, with only one column instead of 17</li> <ul><li>These are entirely unimportant and can be dropped</li></ul> <li>I only need ~100,000 rows out of this file (See below)</li> </ul> </ul> Data example: <pre class="prettyprint"><code>County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ... Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ... Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ... NC [Malformed row] [8.5 Mill rows] </code></pre> I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R: <pre class="prettyprint"><code>County; State; Year; Quarter; Segment; GDP; ... Ada County;NC;2009;4;FIRE;80.1; ... Ada County;NC;2010;1;FIRE;82.5; ... [~200,000 rows] </code></pre> Results: After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation. I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want. It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket. <pre class="prettyprint"><code>## Open a connection separately to hold the cursor position file.in <- file('bad_data.txt', 'rt') file.out <- file('chopped_data.txt', 'wt') line <- readLines(file.in, n=1) line.split <- strsplit(line, ';') # Stitching together only the columns we want cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE) ## Use a loop to read in the rest of the lines line <- readLines(file.in, n=1) while (length(line)) { line.split <- strsplit(line, ';') if (length(line.split[[1]]) > 1) { if (line.split[[1]][3] == '2009') { cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE) } } line<- readLines(file.in, n=1) } close(file.in) close(file.out) </code></pre> Failings by Approach: <ul> <li>sqldf</li> <ul><li>This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.</li></ul> <li>MapReduce</li> <ul><li>To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.</li></ul> <li>bigmemory</li> <ul><li>This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.</li></ul> <li>scan</li> <ul><li>Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.</li></ul> </ul>

I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including: <ol> <li> mapReduce (pure R)</li> <li> RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files</li> </ol> Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a <code>bigmemory</code> object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.

Trimming a huge (3.5 GB) csv file to read into R

Tags:

r

csv

So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.

So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?

Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?

Any ideas would be greatly appreciated.

Update:

Constraints

Needs to use my machine, so no EC2 instances
As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
As you can see below, the data contains mixed types, which I need to operate on later

Data

The data is 3.5GB, with about 8.5 million rows and 17 columns
A couple thousand rows (~2k) are malformed, with only one column instead of 17

These are entirely unimportant and can be dropped

I only need ~100,000 rows out of this file (See below)

Data example:

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ... Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ... Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ... NC  [Malformed row] [8.5 Mill rows]

I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:

County; State; Year; Quarter; Segment; GDP; ... Ada County;NC;2009;4;FIRE;80.1; ... Ada County;NC;2010;1;FIRE;82.5; ... [~200,000 rows]

Results:

After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.

I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.

It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.

## Open a connection separately to hold the cursor position file.in <- file('bad_data.txt', 'rt') file.out <- file('chopped_data.txt', 'wt') line <- readLines(file.in, n=1) line.split <- strsplit(line, ';') # Stitching together only the columns we want cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE) ## Use a loop to read in the rest of the lines line <- readLines(file.in, n=1) while (length(line)) {   line.split <- strsplit(line, ';')   if (length(line.split[[1]]) > 1) {     if (line.split[[1]][3] == '2009') {         cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)     }   }   line<- readLines(file.in, n=1) } close(file.in) close(file.out)

Failings by Approach:

sqldf

This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.

MapReduce

To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.

bigmemory

This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.

scan

Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.

522

asked Jun 22 '10 16:06

FTWynn

2 Answers

My try with readLines. This piece of a code creates csv with selected years.

file_in <- file("in.csv","r") file_out <- file("out.csv","a") x <- readLines(file_in, n=1) writeLines(x, file_out) # copy headers  B <- 300000 # depends how large is one pack while(length(x)) {     ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)     if (length(ind)) writeLines(x[ind], file_out)     x <- readLines(file_in, n=B) } close(file_in) close(file_out)

answered Sep 21 '22 18:09

Marek

I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:

mapReduce (pure R)
RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files

Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a bigmemory object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.

answered Sep 17 '22 18:09

Shane

Related questions
                            
                                use %>% with replacement functions like colnames()<-
                            
                                Get the difference between dates in terms of weeks, months, quarters, and years
                            
                                How to plot all the columns of a data frame in R
                            
                                Get all Parameters as List
                            
                                How to overlay density plots in R?
                            
                                Use a value from the previous row in an R data.table calculation
                            
                                How to prevent scientific notation in R? [duplicate]
                            
                                Legend on bottom, two rows wrapped in ggplot2 in r
                            
                                Filter multiple values on a string column in dplyr
                            
                                How do you add a general label to facets in ggplot2?
                            
                                Types and classes of variables
                            
                                How do I deal with special characters like \^$.?*|+()[{ in my regex?
                            
                                What does "The following object is masked from 'package:xxx'" mean?
                            
                                Error in fetch(key) : lazy-load database
                            
                                Usage of `...` (three-dots or dot-dot-dot) in functions [duplicate]
                            
                                ggplot combining two plots from different data.frames
                            
                                Return index of the smallest value in a vector?
                            
                                Create a data.frame where a column is a list
                            
                                Formula with dynamic number of variables
                            
                                How can I interrupt a running code in R with a keyboard command?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With