Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Help needed with screen scraping using anemone and nokogiri

I have a starting page of http://www.example.com/startpage which has 1220 listings broken up by pagination in the standard way eg 20 results per page.

I have code working that parses the first page of results and follows links that contain "example_guide/paris_shops" in their url. I then use Nokogiri to pull specific data of that final page. All works well and the 20 results are written to a file.

However I can't seem to figure out how to also get Anemone to crawl to the next page of results (http://www.example.com/startpage?page=2) and then continue to parse that page and then the 3rd page (http://www.example.com/startpage?page=3) and so on.

So I'd like to ask if anyone knows how I can get anemone to start on a page, parse all the links on that page (and the next level of data for specific data) but then follow the pagination to the next page of results so anemone can start parsing again and so on and on. Given that the pagination links are different from the links in the results Anemone doesn't of course follow them.

At the moment I am loading the url for the first page of results, letting that finish and then pasting in the next url for the 2nd page of results etc etc. Very manual and inefficient especially for getting hundreds of pages.

Any help would be much appreciated.

require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'

Anemone.crawl("http://www.example.com/startpage", :delay => 3) do |anemone|
  anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |

doc = Nokogiri::HTML(open(page.url))

name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?

open('savedwebdata.txt', 'a') { |f|
  f.puts "#{name}\t#{address}\t#{website}\t#{Time.now}"
}
  end
end
like image 624
ginga Avatar asked Oct 01 '10 04:10

ginga


1 Answers

actually Anemone has the nokogiri doc built into it. if you call page.doc that is a nokogiri doc so no need to have two nokogiri docs

like image 137
Davinj Avatar answered Oct 31 '22 15:10

Davinj