Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping ajax enabled webpages

I need to scrape career pages of multiple companies(with their permission).

Important Factors in deciding what do I use

  1. I would be scraping around 2000 pages daily, so need a decently fast solution
  2. Some of these pages populate data via ajax after page is loaded.
  3. My webstack is Ruby/Rails with MySql etc.
  4. I have written scrapers earlier using scrapy(python) (+ Selenium for ajax enabled pages).

My doubts

  1. I am confused whether I should go with python (i.e. scrapy + Selenium, I think this is the best alternative in python), or instead prefer something in ruby(as my entire codebase is in ruby).
  2. Scrapy + selenium is often slow, are there faster alternatives in ruby?(this would make the decision easier) Most popular Ruby alternative with support for Ajax Loaded pages seems to be Watir Can anybody comment on its speed. Also are there any other ruby alternatives (e.g. Mechanize/Nokogiri + something else for Ajax Loaded pages)

EDIT

Ended up using Watir-webdriver + Nokogiri, so that I can take advantage of active record while storing data. Nokogiri is much faster than Watir-webdriver at extracting data.

Scrapy would have been faster, but the speed tradeoff wasn't as significant as the complexity tradeoff in handling different kind of websites in scrapy (e.g. ajax-driven search on some target sites, which i have to necessarily go through).

Hopefully this helps someone.

like image 974
nik-v Avatar asked Nov 02 '22 13:11

nik-v


2 Answers

If speed is important, you can use watir-webdriver gem to drive PhantomJS (headless browser with JavaScript support). Open any page in PhantomJS, and if watir-webdriver is too slow to get the data out of it, you can pass the rendered HTML to Nokogiri.

Read more:

  • http://jkotests.wordpress.com/2013/08/21/watir-nokogiri-gem-published/
  • http://zeljkofilipin.com/watir-nokogiri/
like image 88
Željko Filipin Avatar answered Nov 09 '22 16:11

Željko Filipin


You should check out this guide Making AJAX Applications Crawlable published by Google, it discusses the AJAX crawling scheme which some website support.

You want to look for #! in the URL's hash fragment, this indicates to the crawler that the site supports the AJAX crawling scheme and that the server will return a HTML snapshot of the page when URL is slightly modified.

Full Specification

like image 37
r-sal Avatar answered Nov 09 '22 16:11

r-sal