Get innerHTML via Jsoup

Question

Im trying to scrape data from this website: http://www.bundesliga.de/de/liga/tabelle/

In the source code i can see the tables but there's no content, just things like:

<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
....

With firebug (F12 in Firefox) i wont see any content too but i can select the table and then copy the innerHTML via firebug option. In that case i get all the informations about the teams, but i dont know how to get the table with the content in Jsoup.

Adel · Accepted Answer

To get the value of an attribute, use the Node.attr(String key) method For the text on an element (and its combined children), use Element.text() For HTML, use Element.html(), or Node.outerHtml() as appropriate For example:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"

reference: http://jsoup.org/cookbook/extracting-data/attributes-text-html

luksch · Answer

The table is not rendered on the server directly, but build by the client side JavaScript of the page and constructed with data that is getting to the client via AJAX. So what you get with the naive Jsoup approach is expected.

I see two possible solutions:

You analyze the network traffic and identify the ajax calls that the site is making. Then you try to reconstruct the format and fire the same requests as the JavaScript would. Then you can reconstruct the table.
you don't use Jsoup but a real browser, that loads the page and runs the JavaScript including all AJAX calls. You could use Selenium webdriver for that. There is a headless browser called phantomjs which has a relatively small footprint that you can use in combination with selenium webdriver.

both options have their (dis)advantages:

This takes more time, since you need to understand the network traffic pretty good. The reward will be a very fast and memory efficient scraper.
The programming of selenium is very easy and you should not have any difficulties achieving your goal. You don't need to understand the inner workings of the site you want to scrape. However, the price is a further dependency in your project. Memory consumption is high. Another process runs. The scraping will be slow.

Maybe you find another source with the soccer table that is holding the infos you want? That might be the easiest. For example http://www.fussballdaten.de/bundesliga/

Get innerHTML via Jsoup

Tags:

html

web-scraping

jsoup

unrated

2 Answers

Adel

luksch

Recent Activity

Donate For Us

Get innerHTML via Jsoup

Tags:

html

web-scraping

jsoup

unrated

2 Answers

Adel

luksch

Related questions

Recent Activity

Donate For Us