Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup fetching a partial page

I am trying to scrape the contents of bidding websites, but am unable to fetch the complete page of the website . I am using crowbar on xulrunner to fetch the page first (as ajax loads certain elements in lazy fashion) and then scrape from the file. But on the mainpage of bidrivals website, this fails even when the local file is well formed. jSoup simply seems to end with '...' characters midway in the html code. If anyone has encountered this before, please help. The following Code is called for [this link].

File f = new File(projectLocation+logFile+"bidrivalsHome");
    try {
        f.createNewFile();
        log.warn("Trying to fetch mainpage through a console.");
        WinRedirect.redirect(projectLocation+"Curl.exe -s --data \"url="+website+"&delay="+timeDelay+"\" http://127.0.0.1:10000", projectLocation, logFile+"bidrivalsHome");
    } catch (Exception e) {
        e.printStackTrace();
        log.warn("Error in fetching the nameList", e);
    }
    Document doc = new Document("");
    try {
        doc = Jsoup.parse(f, "UTF-8", website);
    } catch (IOException e1) {
        System.out.println("Error while parsing the document.");
        e1.printStackTrace();
        log.warn("Error in parsing homepage", e1);
    }
like image 899
sumit Avatar asked Jun 16 '11 06:06

sumit


People also ask

What does jsoup parse do?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

Does jsoup run JavaScript?

jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.

Is jsoup an API?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.


1 Answers

Try using HtmlUnit to render the page (including JavaScript and CSS dom manipulation) and then pass the rendered HTML to jsoup.

// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);

// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml(), baseURI);

// clean up resources        
webClient.close();


page.html - source code

<html>
<head>
    <script src="loadData.js"></script>
</head>
<body onLoad="loadData()">
    <div class="container">
        <table id="data" border="1">
            <tr>
                <th>col1</th>
                <th>col2</th>
            </tr>
        </table>
    </div>
</body>
</html>

loadData.js

    // append rows and cols to table.data in page.html
    function loadData() {
        data = document.getElementById("data");
        for (var row = 0; row < 2; row++) {
            var tr = document.createElement("tr");
            for (var col = 0; col < 2; col++) {
                td = document.createElement("td");
                td.appendChild(document.createTextNode(row + "." + col));
                tr.appendChild(td);
            }
            data.appendChild(tr);
        }
    }

page.html when loaded to browser

| Col1   | Col2   |
| ------ | ------ |
| 0.0    | 0.1    |
| 1.0    | 1.1    |

Using jsoup to parse page.html for col data

    // load source from file
    Document doc = Jsoup.parse(new File("page.html"), "UTF-8");

    // iterate over row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

Output

(empty)

What happened?

Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation. In this example, the rows and cols are never appended to the data table.

How to parse my page as rendered in the browser?

    // load page using HTML Unit and fire scripts
    WebClient webClient = new WebClient();
    HtmlPage myPage = webClient.getPage(new File("page.html").toURI().toURL());

    // convert page to generated HTML and convert to document
    doc = Jsoup.parse(myPage.asXml());

    // iterate row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

    // clean up resources        
    webClient.close();

Output

0.0
0.1
1.0
1.1
like image 101
Zack Avatar answered Nov 15 '22 22:11

Zack