Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to run multiple nokogiri screen scrape threads at once

I have a website that requires using Nokogiri on many different websites to extract data. This process is ran as a background job using the delayed_job gem. However it takes around 3-4 seconds per page to run because it has to pause and wait for other websites to respond. I am currently just running them by basically saying

Websites.all.each do |website|
  # screen scrape
end

I would like to execute them in batches rather than one each so that I dont have to wait for a server response from every site (can take up to 20 seconds on occassion).

What would be the best ruby or rails way to do this?

Thanks for your help in advance.

like image 918
Nick Barrett Avatar asked Mar 21 '11 13:03

Nick Barrett


3 Answers

You might want to check out Typhoeus which enables you to make parallel http requests.

I found a short blawg post here about using it with Nokogiri, but I haven't tried this myself.

Wrapped in a DJ, this should do the trick with little client-side latency.

like image 183
ybakos Avatar answered Nov 05 '22 15:11

ybakos


You need to use delayed job. Check out this Railscasts.

Keep in mind most hosts charge for this type of thing.

You can also use the spawn plugin if you don't care about managing threads but it is much much easier!!!

This is literally all you need to do:

  1. rails plugin/install https://github.com/tra/spawn.git
  2. Then in your controller or model add the method

For example:

 spawn do
    #execute your code here :)
 end 

http://railscasts.com/episodes/171-delayed-job

https://github.com/tra/spawn

like image 37
thenengah Avatar answered Nov 05 '22 17:11

thenengah


I'm using EventMachine to do something similar to this for a current project. There is a terrific plugin called em-http-request that allows you to make mutliple HTTP requests in parallel, as well as providing options for synchronising the responses.

From the em-http-request github docs:

EventMachine.run {
  http1 = EventMachine::HttpRequest.new('http://google.com/').get
  http2 = EventMachine::HttpRequest.new('http://yahoo.com/').get

  http1.callback { }
  http2.callback { } 
end

So in your case, you could have

callbacks = []
Websites.all.each do |website|
    callbacks << EventMachine::HttpRequest.new(website.url).get
end

callbacks.each do |http|
    http.callback { }
end

Run your rails application with the thin webserver in order to get a functioning EventMachine loop:

bundle exec rails server thin

You'll also need the eventmachine and em-http-request gems. Good luck!

like image 2
Dan Garland Avatar answered Nov 05 '22 15:11

Dan Garland