Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream and unzip large csv file with ruby

I have problem where I need to download, unzip, and then process line by line a very large CSV file. I think it's useful to give you an idea how large the file is:

  • big_file.zip ~700mb
  • big_file.csv ~23gb

Here's some things I'd like to happen:

  • Don't have to download the whole file before unzipping
  • Don't have to unzip whole file before parsing csv lines
  • Don't use up very much memory/disk while doing all this

I don't know if that's possible or not. Here's what I was thinking:

require 'open-uri'
require 'rubyzip'
require 'csv'

open('http://foo.bar/big_file.zip') do |zipped|
  Zip::InputStream.open(zipped) do |unzipped|
    sleep 10 until entry = unzipped.get_next_entry && entry.name == 'big_file.csv'
    CSV.foreach(unzipped) do |row|
      # process the row, maybe write out to STDOUT or some file
    end
  end
end

Here's the problems I know about:

  • open-uri reads the whole response and saves it into a Tempfile which is no good with a file this size. I'd probably need to use Net::HTTP directly but I'm not sure how to do that and still get an IO.
  • I don't know how fast the download is going to be or if the Zip::InputStream works the way I've shown it working. Can it unzip some of the file when it's not all there yet?
  • Will the CSV.foreach work with rubyzip's InputStream? Does it behave enough like File that it will be able to parse out the rows? Will it freak out if it wants to read but the buffer is empty?

I don't know if any of this is the right approach. Maybe some EventMachine solution would be better (although I've never used EventMachine before, but if it works better for something like this, I'm all for it).

like image 981
ZombieDev Avatar asked Apr 29 '14 23:04

ZombieDev


1 Answers

It's been a while since I posted this question and in case anyone else comes across it I thought it might be worth sharing what I found.

  1. For the number of rows I was dealing with Ruby's standard library CSV was too slow. My csv file was simple enough that I didn't need all that stuff to deal with quoted strings or type coercion anyway. It was much easier just use IO#gets and then split the line on commas.
  2. I was unable to stream the entire thing from http to a Zip::Inputstream to some IO containing the csv data. This is because the zip file structure has the End of Central Directory (EOCD) at the end of the file. That is needed in order to extract the file so streaming it from http doesn't seem like it would work.

The solution I ended up going with was to download the file to disk and then use Ruby's open3 library and the Linux unzip package to stream the uncompressed csv file from the zip.

require 'open3'

IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
  line = io.gets
  # do stuff to process the CSV line
end

The -p switch on unzip sends the extracted file to stdout. IO.popen then use pipes to make that an IO object in ruby. Works pretty nice. You could use it with the CSV too if you wanted that extra processing, it was just too slow for me.

require 'open3'
require 'csv'

IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
  CSV.foreach(io) do |row|
    # process the row
  end
end
like image 142
ZombieDev Avatar answered Oct 12 '22 21:10

ZombieDev