I am trying to scrape the contents of bidding websites, but am unable to fetch the complete page of the website . I am using crowbar on xulrunner to fetch the page first (as ajax loads certain elements in lazy fashion) and then scrape from the file. But on the mainpage of bidrivals website, this fails even when the local file is well formed. jSoup simply seems to end with '...' characters midway in the html code. If anyone has encountered this before, please help. The following Code is called for [this link]. <pre class="prettyprint"><code>File f = new File(projectLocation+logFile+"bidrivalsHome"); try { f.createNewFile(); log.warn("Trying to fetch mainpage through a console."); WinRedirect.redirect(projectLocation+"Curl.exe -s --data \"url="+website+"&delay="+timeDelay+"\" http://127.0.0.1:10000", projectLocation, logFile+"bidrivalsHome"); } catch (Exception e) { e.printStackTrace(); log.warn("Error in fetching the nameList", e); } Document doc = new Document(""); try { doc = Jsoup.parse(f, "UTF-8", website); } catch (IOException e1) { System.out.println("Error while parsing the document."); e1.printStackTrace(); log.warn("Error in parsing homepage", e1); } </code></pre>

Try using HtmlUnit to render the page (including JavaScript and CSS dom manipulation) and then pass the rendered HTML to jsoup. <pre class="prettyprint"><code>// load page using HTML Unit and fire scripts WebClient webClient = new WebClient(); HtmlPage myPage = webClient.getPage(myURL); // convert page to generated HTML and convert to document Document doc = Jsoup.parse(myPage.asXml(), baseURI); // clean up resources webClient.close(); </code></pre> <hr> <hr> page.html - source code <pre class="prettyprint"><code><html> <head> <script src="loadData.js"></script> </head> <body onLoad="loadData()"> <div class="container"> <table id="data" border="1"> <tr> <th>col1</th> <th>col2</th> </tr> </table> </div> </body> </html> </code></pre> loadData.js <pre class="prettyprint"><code> // append rows and cols to table.data in page.html function loadData() { data = document.getElementById("data"); for (var row = 0; row < 2; row++) { var tr = document.createElement("tr"); for (var col = 0; col < 2; col++) { td = document.createElement("td"); td.appendChild(document.createTextNode(row + "." + col)); tr.appendChild(td); } data.appendChild(tr); } } </code></pre> page.html when loaded to browser <pre class="prettyprint"><code>| Col1 | Col2 | | ------ | ------ | | 0.0 | 0.1 | | 1.0 | 1.1 | </code></pre> Using jsoup to parse page.html for col data <pre class="prettyprint"><code> // load source from file Document doc = Jsoup.parse(new File("page.html"), "UTF-8"); // iterate over row and col for (Element row : doc.select("table#data > tbody > tr")) for (Element col : row.select("td")) // print results System.out.println(col.ownText()); </code></pre> Output (empty) What happened? Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation. In this example, the rows and cols are never appended to the data table. How to parse my page as rendered in the browser? <pre class="prettyprint"><code> // load page using HTML Unit and fire scripts WebClient webClient = new WebClient(); HtmlPage myPage = webClient.getPage(new File("page.html").toURI().toURL()); // convert page to generated HTML and convert to document doc = Jsoup.parse(myPage.asXml()); // iterate row and col for (Element row : doc.select("table#data > tbody > tr")) for (Element col : row.select("td")) // print results System.out.println(col.ownText()); // clean up resources webClient.close(); </code></pre> Output <pre class="prettyprint"><code>0.0 0.1 1.0 1.1 </code></pre>

Jsoup fetching a partial page

Tags:

java

web-scraping

jsoup

I am trying to scrape the contents of bidding websites, but am unable to fetch the complete page of the website . I am using crowbar on xulrunner to fetch the page first (as ajax loads certain elements in lazy fashion) and then scrape from the file. But on the mainpage of bidrivals website, this fails even when the local file is well formed. jSoup simply seems to end with '...' characters midway in the html code. If anyone has encountered this before, please help. The following Code is called for [this link].

Click to copy

File f = new File(projectLocation+logFile+"bidrivalsHome");
    try {
        f.createNewFile();
        log.warn("Trying to fetch mainpage through a console.");
        WinRedirect.redirect(projectLocation+"Curl.exe -s --data \"url="+website+"&delay="+timeDelay+"\" http://127.0.0.1:10000", projectLocation, logFile+"bidrivalsHome");
    } catch (Exception e) {
        e.printStackTrace();
        log.warn("Error in fetching the nameList", e);
    }
    Document doc = new Document("");
    try {
        doc = Jsoup.parse(f, "UTF-8", website);
    } catch (IOException e1) {
        System.out.println("Error while parsing the document.");
        e1.printStackTrace();
        log.warn("Error in parsing homepage", e1);
    }

899

asked Jun 16 '11 06:06

sumit

1 Answers

Try using HtmlUnit to render the page (including JavaScript and CSS dom manipulation) and then pass the rendered HTML to jsoup.

Click to copy

// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);

// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml(), baseURI);

// clean up resources        
webClient.close();

page.html - source code

Click to copy

<html>
<head>
    <script src="loadData.js"></script>
</head>
<body onLoad="loadData()">
    <div class="container">
        <table id="data" border="1">
            <tr>
                <th>col1</th>
                <th>col2</th>
            </tr>
        </table>
    </div>
</body>
</html>

loadData.js

Click to copy

    // append rows and cols to table.data in page.html
    function loadData() {
        data = document.getElementById("data");
        for (var row = 0; row < 2; row++) {
            var tr = document.createElement("tr");
            for (var col = 0; col < 2; col++) {
                td = document.createElement("td");
                td.appendChild(document.createTextNode(row + "." + col));
                tr.appendChild(td);
            }
            data.appendChild(tr);
        }
    }

page.html when loaded to browser

Click to copy

| Col1   | Col2   |
| ------ | ------ |
| 0.0    | 0.1    |
| 1.0    | 1.1    |

Using jsoup to parse page.html for col data

Click to copy

    // load source from file
    Document doc = Jsoup.parse(new File("page.html"), "UTF-8");

    // iterate over row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

Output

(empty)

What happened?

Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation. In this example, the rows and cols are never appended to the data table.

How to parse my page as rendered in the browser?

Click to copy

    // load page using HTML Unit and fire scripts
    WebClient webClient = new WebClient();
    HtmlPage myPage = webClient.getPage(new File("page.html").toURI().toURL());

    // convert page to generated HTML and convert to document
    doc = Jsoup.parse(myPage.asXml());

    // iterate row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

    // clean up resources        
    webClient.close();

Output

Click to copy

0.0
0.1
1.0
1.1

101

answered Nov 15 '22 22:11

Zack

Related questions
                            
                                How to specify a JRE range in jnlp file?
                            
                                Hibernate - One to many relationship and orphanRemoval cascade
                            
                                What are the Steps for Debugging Java Web Application?
                            
                                How to set value in web.xml using property file?
                            
                                How to create a regex for parsing Arabic Dates
                            
                                Using JAX-RS / Jersey with Freemarker templates
                            
                                How does Java dispatch KeyEvents?
                            
                                XMLUnit - Compare two XML ignoring the child order [duplicate]
                            
                                Testing file uploading and downloading speed using FTP
                            
                                How to set a custom background color on a line in a JTextPane
                            
                                How to increase alexa search results using alexa api [closed]
                            
                                What's the difference between Thread.yield() and Thread.sleep(0) in Java? [duplicate]
                            
                                Need help with color selector for checked ListView items
                            
                                Java Servlet Deployment - To Embed or Not - Tomcat/Jetty
                            
                                Custom Deserialization of JSON FIELD with Jackson in java?
                            
                                Spring NamespaceHandler issue when launching Maven-based GWT App from Eclipse IDE after migration to Spring 3
                            
                                How to minify different javascript files at runtime using Java
                            
                                Record audio on Android with MediaPlayer as source?
                            
                                Ending a Java thread in C (JNI)
                            
                                Programmatic Bean Validation (JSR 303) without Annotation [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Jsoup fetching a partial page

Tags:

java

web-scraping

jsoup

sumit

People also ask

1 Answers

Zack

Recent Activity

Donate For Us