Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)?

What are the best Java libraries to "fully download any webpage and render the built-in JavaScript(s) and then access the rendered webpage (that is the DOM-Tree !) programmatically and get the DOM Tree as an "HTML-Source"?

(Something similarly what firebug does in the end, it renders the page and I get access to the fully rendered DOM Tree, as the page looks like in the browser! In contrast, if I click "show source" I only get the JavaScript source code. This is not what I want. I need to have access to the rendered page...)

(With rendering I mean only rendering the DOM Tree not a visual rendering...)

This does not have to be one single library, it's ok to have several libraries that can accomplish this together (one will download, one render...), but due to the dynamic nature of JavaScript most likely the JavaScript library will also have to have some kind of downloader to fully render any asynchronous JS...

Background:
In the "good old days" HttpClient (Apache Library) was everything required to build your own very simple crawler. (A lot of cralwers like Nutch or Heretrix are still built around this core princible, mainly focussing on Standard HTML parsing, so I can't learn from them) My problem is that I need to crawl some websites that rely heavily on JavaScript and that I can't parse with HttpClient as I defenitely need to execute the JavaScripts before...

like image 249
tim Avatar asked Jan 29 '10 09:01

tim


People also ask

What is render tree in HTML?

Render tree contains only the nodes required to render the page. Layout computes the exact position and size of each object. The last step is paint, which takes in the final render tree and renders the pixels to the screen.

How does browser create DOM tree?

The DOM and CSSOM are both trees. They are independent data structures. The browser converts the CSS rules into a map of styles it can understand and work with. The browser goes through each rule set in the CSS, creating a tree of nodes with parent, child, and sibling relationships based on the CSS selectors.

What is rendered in JavaScript?

Javascript uses the document object model (DOM) to manipulate the DOM elements. Rendering refers to showing the output in the browser. The DOM establishes parent-child relationships, and adjacent sibling relationships, among the various elements in the HTML file.

How do you render something in JavaScript?

The render() method # var render = function (template, node) { // Codes goes here... }; This follows the same structure as React, where you pass in a template and the node to render it into. To render content, we'll use innerHTML , a property that let's you set the inner HTML of an element.


2 Answers

This is a bit outside of the box, but if you are planning on running your code in a server where you have complete control over your environment, it might work...

Install Firefox (or XulRunner, if you want to keep things lightweight) on your machine.

Using the Firefox plugins system, write a small plugin which takes loads a given URL, waits a few seconds, then copies the page's DOM into a String.

From this plugin, use the Java LiveConnect API (see http://jdk6.java.net/plugin2/liveconnect/ and https://developer.mozilla.org/en/LiveConnect ) to push that string across to a public static function in some embedded Java code, which can either do the required processing itself or farm it out to some more complicated code.

Benefits: You are using a browser that most application developers target, so the observed behavior should be comparable. You can also upgrade the browser along the normal upgrade path, so your library won't become out-of-date as HTML standards change.

Disadvantages: You will need to have permission to start a non-headless application on your server. You'll also have the complexity of inter-process communication to worry about.

I have used the plugin API to call Java before, and it's quite achievable. If you'd like some sample code, you should take a look at the XQuery plugin - it loads XQuery code from the DOM, passes it across to the Java Saxon library for processing, then pushes the result back into the browser. There are some details about it here:

https://developer.mozilla.org/en/XQuery

like image 157
Erica Avatar answered Sep 17 '22 09:09

Erica


You can use JavaFX 2 WebEngine. Download JavaFX SDK (you may already have it if you installed JDK7u2 or later) and try code below.

It will print html with processed javascript. You can uncomment lines in the middle to see rendering as well.

public class WebLauncher extends Application {

    @Override
    public void start(Stage stage) {
        final WebView webView = new WebView();
        final WebEngine webEngine = webView.getEngine();
        webEngine.load("http://stackoverflow.com");
        //stage.setScene(new Scene(webView));
        //stage.show();

        webEngine.getLoadWorker().workDoneProperty().addListener(new ChangeListener<Number>() {
            @Override
            public void changed(ObservableValue<? extends Number> observable, Number oldValue, Number newValue) {
                if (newValue.intValue() == 100 /*percents*/) {
                    try {
                        org.w3c.dom.Document doc = webEngine.getDocument();
                        new XMLSerializer(System.out, new OutputFormat(doc, "UTF-8", true)).serialize(doc);
                    } catch (IOException ex) { 
                        ex.printStackTrace();
                    }
                }
            }
        });

    }

    public static void main(String[] args) {
        launch();
    }

}
like image 27
Sergey Grinev Avatar answered Sep 20 '22 09:09

Sergey Grinev