I am trying to download more than 1m pages (URLs ending by a sequence ID). I have implemented kind of multi-purpose download manager with configurable number of download threads and one processing thread. The downloader downloads files in batches:
curl = Curl::Easy.new
batch_urls.each { |url_info|
curl.url = url_info[:url]
curl.perform
file = File.new(url_info[:file], "wb")
file << curl.body_str
file.close
# ... some other stuff
}
I have tried to download 8000 pages sample. When using the code above, I get 1000 in 2 minutes. When I write all URLs into a file and do in shell:
cat list | xargs curl
I gen all 8000 pages in two minutes.
Thing is, I need it to have it in ruby code, because there is other monitoring and processing code.
I have tried:
Why is reused Curl::Easy slower than subsequent command line curl calls and how can I make it faster? Or what I am doing wrong?
I would prefer to fix my download manager code than to make downloading for this case in a different way.
Before this, I was calling command-line wget which I provided with a file with list of URLs. Howerver, not all errors were handled, also it was not possible to specify output file for each URL separately when using URL list.
Now it seems to me that the best way would be to use multiple threads with system call to 'curl' command. But why when I can use directly Curl in Ruby?
Code for the download manager is here, if it might help: Download Manager (I have played with timeouts, from not-setting it to various values, it did not seem help)
Any hints appreciated.
This could be a fitting task for Typhoeus
Something like this (untested):
require 'typhoeus'
def write_file(filename, data)
file = File.new(filename, "wb")
file.write(data)
file.close
# ... some other stuff
end
hydra = Typhoeus::Hydra.new(:max_concurrency => 20)
batch_urls.each do |url_info|
req = Typhoeus::Request.new(url_info[:url])
req.on_complete do |response|
write_file(url_info[:file], response.body)
end
hydra.queue req
end
hydra.run
Come to think of it, you might get a memory problem because of the enormous amout of files. One way to prevent that would be to never store the data in a variable but instead stream it to the file directly. You could use em-http-request for that.
EventMachine.run {
http = EventMachine::HttpRequest.new('http://www.website.com/').get
http.stream { |chunk| print chunk }
# ...
}
So, if you don't set a on_body handler than curb will buffer the download. If you're downloading files you should use an on_body handler. If you want to download multiple files using Ruby Curl, try the Curl::Multi.download interface.
require 'rubygems'
require 'curb'
urls_to_download = [
'http://www.google.com/',
'http://www.yahoo.com/',
'http://www.cnn.com/',
'http://www.espn.com/'
]
path_to_files = [
'google.com.html',
'yahoo.com.html',
'cnn.com.html',
'espn.com.html'
]
Curl::Multi.download(urls_to_download, {:follow_location => true}, {}, path_to_files) {|c,p|}
If you want to just download a single file.
Curl::Easy.download('http://www.yahoo.com/')
Here is a good resource: http://gist.github.com/405779
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With