Best way to work with large amounts of CSV data quickly

Tags:

I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.

Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.

A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV file with the millions of lines has to be walked through many, many, many times apparently.

My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)

Can I load this many records into memory? If so, how should I do it?

I know this is a little vague. Just looking for ideas as this is a little new to me.

431

asked Apr 05 '11 19:04

NJ.

2 Answers

Here is some ruby code I wrote to process large csv files (~180mb in my case).

https://gist.github.com/1323865

A standard FasterCSV.parse pulling it all into memory was taking over an hour. This got it down to about 10 minutes.

The relevant part is this:

lines = []
IO.foreach('/tmp/zendesk_tickets.csv') do |line|
  lines << line
  if lines.size >= 1000
    lines = FasterCSV.parse(lines.join) rescue next
    store lines
    lines = []
  end
end
store lines

IO.foreach doesn't load the entire file into memory and just steps through it with a buffer. When it gets to 1000 lines, it tries parsing a csv and inserting just those rows. One tricky part is the "rescue next". If your CSV has some fields that span multiple lines, you may need to grab a few more lines to get a valid parseable csv string. Otherwise the line you're on could be in the middle of a field.

In the gist you can see one other nice optimization which uses MySQL's update ON DUPLICATE KEY. This allows you to insert in bulk and if a duplicate key is detected it simply overwrites the values in that row instead of inserting a new row. You can think of it like a create/update in one query. You'll need to set a unique index on at least one column for this to work.

answered Sep 30 '22 20:09

Brian Armstrong

how about using a database.

jam the records into tables, and then query them out using joins.

the import might take awhile, but the DB engine will be optimized for the join and retrieval part...

answered Sep 30 '22 20:09

Randy

Related questions
                            
                                Having a Ruby block/command silently fail without a blank 'rescue' block
                            
                                Do ruby on rails programmers refactor?
                            
                                String.force_encoding() in Ruby 1.8.7 (or Rails 2.x)
                            
                                Spork.prefork is loading app/models/*
                            
                                capybara - Find with xPath is leaving the within scope
                            
                                Ruby on Rails: /bin/sh: rspec: command not found
                            
                                Undefined method `merge' for '####':string <%= form_for %> helper
                            
                                Rails 4 Strong Parameters : can I 'exclude' / blacklist attributes instead of permit / whitelist?
                            
                                Nokogiri won't let me bundle install in Rails
                            
                                ruby nokogiri gem install mac osx high sierra
                            
                                How to ignore a folder in Zeitwerk for Rails 6?
                            
                                How to open a file and search for a word?
                            
                                Regenerate ctags in vim using RVM
                            
                                ruby sort array of an array
                            
                                Using MIddleman 3.0 - How do I set individual page titles on dynamic pages?
                            
                                Get response headers from Curb
                            
                                undefined method `desc' for Sinatra::Application:Class
                            
                                make: /usr/bin/mkdir: Command not found during `gem install nokogiri` in Ubuntu 20.04 [closed]
                            
                                How to delete duplicate records in mysql database?
                            
                                Defining "method_called".. How do I make a hook method which gets called every time any function of a class gets called?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to work with large amounts of CSV data quickly

Tags:

ruby

csv

NJ.

People also ask

2 Answers

Brian Armstrong

Randy

Recent Activity

Donate For Us