Scraping ajax enabled webpages

Question

I need to scrape career pages of multiple companies(with their permission).

Important Factors in deciding what do I use

I would be scraping around 2000 pages daily, so need a decently fast solution
Some of these pages populate data via ajax after page is loaded.
My webstack is Ruby/Rails with MySql etc.
I have written scrapers earlier using scrapy(python) (+ Selenium for ajax enabled pages).

My doubts

I am confused whether I should go with python (i.e. scrapy + Selenium, I think this is the best alternative in python), or instead prefer something in ruby(as my entire codebase is in ruby).
Scrapy + selenium is often slow, are there faster alternatives in ruby?(this would make the decision easier) Most popular Ruby alternative with support for Ajax Loaded pages seems to be Watir Can anybody comment on its speed. Also are there any other ruby alternatives (e.g. Mechanize/Nokogiri + something else for Ajax Loaded pages)

EDIT

Ended up using Watir-webdriver + Nokogiri, so that I can take advantage of active record while storing data. Nokogiri is much faster than Watir-webdriver at extracting data.

Scrapy would have been faster, but the speed tradeoff wasn't as significant as the complexity tradeoff in handling different kind of websites in scrapy (e.g. ajax-driven search on some target sites, which i have to necessarily go through).

Hopefully this helps someone.

Željko Filipin · Accepted Answer

If speed is important, you can use watir-webdriver gem to drive PhantomJS (headless browser with JavaScript support). Open any page in PhantomJS, and if watir-webdriver is too slow to get the data out of it, you can pass the rendered HTML to Nokogiri.

Read more:

http://jkotests.wordpress.com/2013/08/21/watir-nokogiri-gem-published/
http://zeljkofilipin.com/watir-nokogiri/

r-sal · Answer

You should check out this guide Making AJAX Applications Crawlable published by Google, it discusses the AJAX crawling scheme which some website support.

You want to look for #! in the URL's hash fragment, this indicates to the crawler that the site supports the AJAX crawling scheme and that the server will return a HTML snapshot of the page when URL is slightly modified.

Full Specification

Scraping ajax enabled webpages

Tags:

ruby

scrapy

nokogiri

mechanize

watir

nik-v

2 Answers

Željko Filipin

r-sal

Recent Activity

Donate For Us

Scraping ajax enabled webpages

Tags:

ruby

scrapy

nokogiri

mechanize

watir

nik-v

2 Answers

Željko Filipin

r-sal

Related questions

Recent Activity

Donate For Us