Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I scrape something after JS has changed the DOM?

I'm using Mechanize, although I'm open to Nokogiri if Mechanize can't do it.

I'd like to scrape the page after all the scripts have loaded as opposed to beforehand.

How might I do this?

like image 480
Tallboy Avatar asked May 15 '12 21:05

Tallboy


2 Answers

I think a good option is something like this with Nokogiri, Watir, and PhantomJs:

b = Watir::Browser.new(:phantomjs)

b.goto URL

doc = Nokogiri::HTML(b.html)

The resulting doc will be from when after the scripts have been loaded. And phantomjs is nice because there is no need to load a browser.

like image 76
Matt McNaughton Avatar answered Sep 30 '22 16:09

Matt McNaughton


Nokogiri and Mechanize are not full web browsers and do not run JavaScript in a browser-model DOM. You want to use something like Watir or Selenium which allow you to use Ruby to control an actual web browser.

like image 24
Phrogz Avatar answered Sep 30 '22 18:09

Phrogz