Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Python to parse a 12GB CSV

I have a 12 GB CSV file. I'm hoping to extract only some columns from this data and then write a new CSV that hopefully I can load into R for analysis.

The problem is that I'm getting a memory error when trying to load the entire list at once before writing the new CSV file. How can I parse the data row by row and then create a CSV output?

Here is what I have so far:

import pandas

colnames = ['contributor name', 'recipient name', 'recipient party', 'contributor cfscore', 'candidate cfscore', 'amount']

DATA = pandas.read_csv('pathname\filename.csv', names=colnames)
DATA.to_csv(''pathname\filename.csv', cols = colnames)
like image 699
ModalBro Avatar asked May 25 '14 17:05

ModalBro


People also ask

How do I read a 10gb csv file in Python?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.

Can Python parse CSV?

Parsing CSV files in Python is quite easy. Python has an inbuilt CSV library which provides the functionality of both readings and writing the data from and to CSV files.


1 Answers

In R, you can use the fread function from the popular data.table package.

You can use the drop= argument to specify columns not to be read -- no memory is allocated for them, and they are not read at all. Or select= the columns you want to keep, if that is more convenient. fread can read csv files very, very quickly.

If you're dealing with this much data, you'll probably want to familiarize yourself with the data.table package anyway.


Alternatively, ?read.csv.sql from the sqldf package says it will

Read a file into R filtering it with an sql statement. Only the filtered portion is processed by R so that files larger than R can otherwise handle can be accommodated.

Here's the example:

write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv", 
                      sql="select * from file where Species = 'setosa' ")
like image 125
GSee Avatar answered Sep 21 '22 21:09

GSee