I have a 12 GB CSV file. I'm hoping to extract only some columns from this data and then write a new CSV that hopefully I can load into R for analysis.
The problem is that I'm getting a memory error when trying to load the entire list at once before writing the new CSV file. How can I parse the data row by row and then create a CSV output?
Here is what I have so far:
import pandas
colnames = ['contributor name', 'recipient name', 'recipient party', 'contributor cfscore', 'candidate cfscore', 'amount']
DATA = pandas.read_csv('pathname\filename.csv', names=colnames)
DATA.to_csv(''pathname\filename.csv', cols = colnames)
read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
Parsing CSV files in Python is quite easy. Python has an inbuilt CSV library which provides the functionality of both readings and writing the data from and to CSV files.
In R, you can use the fread
function from the popular data.table package.
You can use the drop=
argument to specify columns not to be read -- no memory is allocated for them, and they are not read at all. Or select=
the columns you want to keep, if that is more convenient. fread
can read csv files very, very quickly.
If you're dealing with this much data, you'll probably want to familiarize yourself with the data.table package anyway.
Alternatively, ?read.csv.sql
from the sqldf package says it will
Read a file into R filtering it with an sql statement. Only the filtered portion is processed by R so that files larger than R can otherwise handle can be accommodated.
Here's the example:
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql="select * from file where Species = 'setosa' ")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With