Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get innerHTML via Jsoup

Im trying to scrape data from this website: http://www.bundesliga.de/de/liga/tabelle/

In the source code i can see the tables but there's no content, just things like:

<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
<td>[no content]</td>
....

With firebug (F12 in Firefox) i wont see any content too but i can select the table and then copy the innerHTML via firebug option. In that case i get all the informations about the teams, but i dont know how to get the table with the content in Jsoup.

like image 887
unrated Avatar asked Feb 22 '14 15:02

unrated


2 Answers

To get the value of an attribute, use the Node.attr(String key) method For the text on an element (and its combined children), use Element.text() For HTML, use Element.html(), or Node.outerHtml() as appropriate For example:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"

reference: http://jsoup.org/cookbook/extracting-data/attributes-text-html

like image 171
Adel Avatar answered Oct 15 '22 14:10

Adel


The table is not rendered on the server directly, but build by the client side JavaScript of the page and constructed with data that is getting to the client via AJAX. So what you get with the naive Jsoup approach is expected.

I see two possible solutions:

  1. You analyze the network traffic and identify the ajax calls that the site is making. Then you try to reconstruct the format and fire the same requests as the JavaScript would. Then you can reconstruct the table.
  2. you don't use Jsoup but a real browser, that loads the page and runs the JavaScript including all AJAX calls. You could use Selenium webdriver for that. There is a headless browser called phantomjs which has a relatively small footprint that you can use in combination with selenium webdriver.

both options have their (dis)advantages:

  1. This takes more time, since you need to understand the network traffic pretty good. The reward will be a very fast and memory efficient scraper.
  2. The programming of selenium is very easy and you should not have any difficulties achieving your goal. You don't need to understand the inner workings of the site you want to scrape. However, the price is a further dependency in your project. Memory consumption is high. Another process runs. The scraping will be slow.

Maybe you find another source with the soccer table that is holding the infos you want? That might be the easiest. For example http://www.fussballdaten.de/bundesliga/

like image 27
luksch Avatar answered Oct 15 '22 14:10

luksch