I have a large CSV file on a server I'd like to download and process in chunks, without reading the whole thing into memory. After a bit of finagling I've come up with this:
require open-uri
open("http://example.com/#{LARGE_CSV_FILE}") do |file|
file.each_slice(50_000) do |fifty_thousand_lines|
MyModel.import fifty_thousand_lines.join
end
end
My understanding is that open-uri
's #open
will wrap the HTTP GET and return an IO
-like enumerable object. #each_slice(n)
will pass the block an array of n lines at a time. I then join and process those lines.
This imports just fine, and watching my OS X iStat menu, it looks like the memory usage of the ruby process doesn't get out of hand. However, it looks like I downloaded all of the file at once. How can this be without the memory usage exploding?
Does ruby download it to a temporary file and then read it from disk line by line? I would have thought open-uri
would instead throttle the HTTP connection and only download more data when its block has finished processing its batch of data.
Is this the right way of downloading and processing a file without loading it all into memory?
Yes, it does download to a tempfile. This is easily observed from the console:
2.0.0-p247 :001 > require 'open-uri'
=> true
2.0.0-p247 :002 > f = open("http://stackoverflow.com/questions/19279715/does-ruby-open-uri-http-streaming-throttle-the-download-or-save-to-a-temp-file")
=> #<Tempfile:/tmp/open-uri20140220-27172-1kcjwk2>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With