Page-scraping on the Internet has seem to have hit somewhat of a wall for me, as there are more and more sites that are dependent on JavaScript for rendering portions of the screen.
It seems to me that with so many open source layout and JavaScript renderers released (like WebKit, Gecko and Chromium + V8) that someone must have made a tool for downloading a page and rendering its JavaScript without having to run an actual browser. However, I'm not turning up what I'm looking for with my searches - I've found tools like Selenium-rc, but they depend on a running browser. I'm interested in any tool or library which can do one (or both) of the following:
A program that can be run from the command line (*nix) which, given the source of a page, returns the page's source as rendered by some JS engine.
Integrated support in a particular language that allows one to (easily) pass the source of a page to it and returns the page's source as rendered by some JS engine.
I think #1 is preferable in a general sense, but #2 would be more useful if the tool exists in the language I want to work in. Also, I'm not concerned with the particular JS engine - any relatively modern one will do. What is out there?
When the browser reads HTML code, whenever it encounters an HTML element like html , body , div etc., it creates a JavaScript object called a Node. Eventually, all HTML elements will be converted to JavaScript objects.
Javascript uses the document object model (DOM) to manipulate the DOM elements. Rendering refers to showing the output in the browser. The DOM establishes parent-child relationships, and adjacent sibling relationships, among the various elements in the HTML file.
web kit html to pdf works perfect, it can even produce jpg
http://wkhtmltopdf.googlecode.com
You can look at HTMLUnit. It's main purpose is automatic web testing, but I think it may let you get the rendered page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With