I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.
Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.
A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV file with the millions of lines has to be walked through many, many, many times apparently.
My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)
Can I load this many records into memory? If so, how should I do it?
I know this is a little vague. Just looking for ideas as this is a little new to me.
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here. However, Data & Insights datasets sometimes will contain parameters that exceed these limits.
read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
Here is some ruby code I wrote to process large csv files (~180mb in my case).
https://gist.github.com/1323865
A standard FasterCSV.parse pulling it all into memory was taking over an hour. This got it down to about 10 minutes.
The relevant part is this:
lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
lines << line
if lines.size >= 1000
lines = FasterCSV.parse(lines.join) rescue next
store lines
lines = []
end
end
store lines
IO.foreach doesn't load the entire file into memory and just steps through it with a buffer. When it gets to 1000 lines, it tries parsing a csv and inserting just those rows. One tricky part is the "rescue next". If your CSV has some fields that span multiple lines, you may need to grab a few more lines to get a valid parseable csv string. Otherwise the line you're on could be in the middle of a field.
In the gist you can see one other nice optimization which uses MySQL's update ON DUPLICATE KEY
. This allows you to insert in bulk and if a duplicate key is detected it simply overwrites the values in that row instead of inserting a new row. You can think of it like a create/update in one query. You'll need to set a unique index on at least one column for this to work.
how about using a database.
jam the records into tables, and then query them out using joins.
the import might take awhile, but the DB engine will be optimized for the join and retrieval part...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With