Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting Dynamically Generated HTML With Nokogiri/Open URI

I'm trying to scrape a site by looking at its HTML in Chrome and grabbing the data using Nokogiri. The problem is that some of the tags are dynamically generated, and they don't appear with an open(url) request when using open-uri. Is there a way to "force" a site to dynamically generate its content for a tool like open uri to read?

like image 390
user1427661 Avatar asked Oct 04 '22 14:10

user1427661


1 Answers

If reading it via open-uri doesn't produce the content you need, then chances are good that the client is generating content with Javascript.

This may be good news - by inspecting the AJAX requests that the page makes, you might find a JSON feed of the content you're looking for, which you can then request and parse directly. This would get you your data without having to dig through the HTML - handy!

If that doesn't work for some reason, though, you're going to need to open the page with some kind of browser, let it execute its clientside Javascript, then dump the resulting DOM to HTML. Something like PhantomJS is an excellent choice for this kind of work.

like image 119
Chris Heald Avatar answered Oct 06 '22 04:10

Chris Heald