I have a 12 GB CSV file. I'm hoping to extract only some columns from this data and then write a new CSV that hopefully I can load into R for analysis. The problem is that I'm getting a memory error when trying to load the entire list at once before writing the new CSV file. How can I parse the data row by row and then create a CSV output? Here is what I have so far: <pre class="prettyprint"><code>import pandas colnames = ['contributor name', 'recipient name', 'recipient party', 'contributor cfscore', 'candidate cfscore', 'amount'] DATA = pandas.read_csv('pathname\filename.csv', names=colnames) DATA.to_csv(''pathname\filename.csv', cols = colnames) </code></pre>

In R, you can use the <code>fread</code> function from the popular data.table package. You can use the <code>drop=</code> argument to specify columns not to be read -- no memory is allocated for them, and they are not read at all. Or <code>select=</code> the columns you want to keep, if that is more convenient. <code>fread</code> can read csv files very, very quickly. If you're dealing with this much data, you'll probably want to familiarize yourself with the data.table package anyway. <hr> Alternatively, <code>?read.csv.sql</code> from the sqldf package says it will <blockquote> Read a file into R filtering it with an sql statement. Only the filtered portion is processed by R so that files larger than R can otherwise handle can be accommodated. </blockquote> Here's the example: <pre class="prettyprint"><code>write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE) iris2 <- read.csv.sql("iris.csv", sql="select * from file where Species = 'setosa' ") </code></pre>

Using Python to parse a 12GB CSV

Tags:

python

r

csv

bigdata

I have a 12 GB CSV file. I'm hoping to extract only some columns from this data and then write a new CSV that hopefully I can load into R for analysis.

The problem is that I'm getting a memory error when trying to load the entire list at once before writing the new CSV file. How can I parse the data row by row and then create a CSV output?

Here is what I have so far:

import pandas

colnames = ['contributor name', 'recipient name', 'recipient party', 'contributor cfscore', 'candidate cfscore', 'amount']

DATA = pandas.read_csv('pathname\filename.csv', names=colnames)
DATA.to_csv(''pathname\filename.csv', cols = colnames)

699

asked May 25 '14 17:05

ModalBro

1 Answers

In R, you can use the fread function from the popular data.table package.

You can use the drop= argument to specify columns not to be read -- no memory is allocated for them, and they are not read at all. Or select= the columns you want to keep, if that is more convenient. fread can read csv files very, very quickly.

If you're dealing with this much data, you'll probably want to familiarize yourself with the data.table package anyway.

Alternatively, ?read.csv.sql from the sqldf package says it will

Read a file into R filtering it with an sql statement. Only the filtered portion is processed by R so that files larger than R can otherwise handle can be accommodated.

Here's the example:

write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
                      sql="select * from file where Species = 'setosa' ")

125

answered Sep 21 '22 21:09

GSee

Related questions
                            
                                comparing length of dictionary in jinja flask template
                            
                                What is the benefit of using main method in Python? [closed]
                            
                                How do I sort a list with "Nones last"
                            
                                Using the crypt module in Windows?
                            
                                Getting a name error when trying to input a string [duplicate]
                            
                                TypeError: must be str, not float
                            
                                Removing first bit
                            
                                Group values based on range of number in python
                            
                                Python - how to speed up calculation of distances between cities
                            
                                Django: Filter ModelChoiceField by user
                            
                                Python line_profiler installation
                            
                                Error: Uncaught SyntaxError: Unexpected token &
                            
                                Python Recursion and list
                            
                                How to remove whitespace from end of string in Python?
                            
                                Reading 4 byte integers from binary file in Python
                            
                                How can I stop Python's csv.DictWriter.writerows from adding empty lines between rows in Windows?
                            
                                Difference between @override in Java and @decorator in Python
                            
                                How to break up one print command in two lines of code in Python 3
                            
                                Using GitPython, how do I do git submodule update --init
                            
                                Pythonic random list of booleans of length n with exactly k Trues

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With