Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve fully rendered page using Curl, or other means?

Is there a way to retrieve the fully rendered html from a page with javascript post rendering ? If I use curl, it simply retrieves the base html, but lacks the post rendering of iframes, javascript processing etc.

What would be the best way to accomplish this?

like image 947
Joshua Hong Avatar asked May 30 '12 06:05

Joshua Hong


2 Answers

As no-one else has answered (except the copmment above, but I'll come to that later) I'll try to help as much as possible.

There no "simple" answer. PHP can't process javascript/navigate the DOM natively, so you need something that can.

Your options as I see it:

  1. If you are after screen grab (which is what I'm hoping as you also want Flash to load), I suggest you use one of the commercial APIs that are out there for doing this. You can find some in this list http://www.programmableweb.com/apitag/?q=thumbnail, for example http://www.programmableweb.com/api/convertapi-web2image

  2. Otherwise you need to run something yourself that can handle Javascript and the DOM on, orconnected to, your server. For this, you'd need an automated browser that you can run serverside and get the information you need. Follow the list in Bergi's comment above and you'd need to test a suitable solution - the main one Selinium is great for "unit testing" on a known website, but I'm not sure on how I'd script it to handle random sites, for example. As you would (presumably) only have one "automated browser" and you don't know how long each page will take to load, you'd need to queue the requests and handle one at a time. You'd also need to ensure pop-up alert()s are handled, all the third party libraries (you say you want flash?!) installed, handle redirects, timeouts and potential memory hogs (if running this non-stop, you'll periodically want to kill your browser and restart it to clean out the memory!). Also handle virus attacks, pop-up windows and requests to close the browser completely.

  3. Thirdly, VB has a web-browser component. I used it for a project a long time ago to do something similarish, but on a known site. Whether it's possible with .NET (to me, it' a huge security risk), and how you program for unknowns (e.g. pop-ups and Flash) I have no idea. But if you're desparate an adventurous .NET developer may be able to suggest more.

In summary - if you want more than a screen grab and can choose option 1, good luck ;)

like image 189
Robbie Avatar answered Oct 30 '22 00:10

Robbie


If you're looking for something scriptable with no GUI you could use a headless browser. I've used PhantomJS for similar tasks.

like image 25
Tom Pietrosanti Avatar answered Oct 29 '22 22:10

Tom Pietrosanti