Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get html of fully loaded page (with javascript) as input in java?

I need to parse page, everything is ok except some elements on page are loaded dynamically. I used jsoup for static elements, then when I realized that I really need dynamic elements I tried javafx. I read a lot of answeres on stackoverflow and there were many recommendations to use javafx WebEngine. So I ended with this code.

@Override
public void start(Stage primaryStage) {
    WebView webview = new WebView();
    final WebEngine webengine = webview.getEngine();
    webengine.getLoadWorker().stateProperty().addListener(
            new ChangeListener<State>() {
                public void changed(ObservableValue ov, State oldState, State newState) {
                    if (newState == Worker.State.SUCCEEDED) {
                        Document doc = webengine.getDocument();
                        //Serialize DOM
                        OutputFormat format    = new OutputFormat (doc); 
                        // as a String
                        StringWriter stringOut = new StringWriter ();    
                        XMLSerializer serial   = new XMLSerializer (stringOut, format);
                        try {
                            serial.serialize(doc);
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                        // Display the XML
                        System.out.println(stringOut.toString());
                    }
                }
            });
    webengine.load("http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658");
    primaryStage.setScene(new Scene(webview, 800, 800));
    primaryStage.show();
} 

I made string from org.w3c.dom.Document and printed it. But it was useless too. primaryStage.show() showed me fully loaded page (with element I need rendered on page), but there was no element I need in html code (in output).

This is the third day I'm working on that issue, of course lack of experience is my main problem, nevertheless I have to say: I'm stuck. This is my first java project after reading java complete reference. I make it to get some real-world experience (and for fun). I want to make parser of chinese "ebay".

Here is the problem and my test cases:

http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658 need to get dynamically loaded discount "129.00"

http://item.taobao.com/item.htm?spm=a230r.1.14.67.MNq30d&id=22794120348 need "15.20"

As you can see, if you view this pages with browser at first you see original price and after a second or so - discount.

Is it even possible to get this dynamic discounts from html page? Other elements I need to parse are static. What to try next: another library to render html with javascript or maybe smth else? I really need some advice, don't want to give up.

like image 859
rivf Avatar asked Aug 03 '13 13:08

rivf


People also ask

Can we generate HTML content using Javascript?

1) First, create a div section and add some text to it using <p> tags. 2) Create an element <p> using document. createElement("p"). 3) Create a text, using document.

How do I make sure Javascript is loaded?

What is the best way to make sure javascript is running when page is fully loaded? If you mean "fully loaded" literally, i.e., all images and other resources downloaded, then you have to use an onload handler, e.g.: window. onload = function() { // Everything has loaded, so put your code here };

Do things after page load jQuery?

Method 1: Using the on() method with the load event: The on() method in jQuery is used to attach an event handler for any event to the selected elements. The window object is first selected using a selector and the on() method is used on this element.


1 Answers

DOM model returned after Worker.State.SUCCEEDED shoulb be already processed by javascript.

Your code worked for me with tested with FX 7u40 and 8.0 dev. I see next output in the log:

<DIV id="J_PromoBox"><EM class="tb-promo-price-type">夏季新品</EM><EM class="tm-yen">¥</EM>    
<STRONG class="J_CurPrice">129.00</STRONG></DIV>

which is dynamically loaded box with data (129.00) you looked for.

You may want to upgrade your JDK to 7u40 or revisit your log parsing algorithm.

like image 119
Sergey Grinev Avatar answered Oct 14 '22 16:10

Sergey Grinev