I need to scrape career pages of multiple companies(with their permission).
Important Factors in deciding what do I use
My doubts
EDIT
Ended up using Watir-webdriver + Nokogiri, so that I can take advantage of active record while storing data. Nokogiri is much faster than Watir-webdriver at extracting data.
Scrapy would have been faster, but the speed tradeoff wasn't as significant as the complexity tradeoff in handling different kind of websites in scrapy (e.g. ajax-driven search on some target sites, which i have to necessarily go through).
Hopefully this helps someone.
If speed is important, you can use watir-webdriver gem to drive PhantomJS (headless browser with JavaScript support). Open any page in PhantomJS, and if watir-webdriver is too slow to get the data out of it, you can pass the rendered HTML to Nokogiri.
Read more:
You should check out this guide Making AJAX Applications Crawlable published by Google, it discusses the AJAX crawling scheme which some website support.
You want to look for #!
in the URL's hash fragment, this indicates to the crawler that the site supports the AJAX crawling scheme and that the server will return a HTML snapshot of the page when URL is slightly modified.
Full Specification
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With