Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup get all links from a page

I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag. Here's my code snippet:

Document doc = Jsoup.connect(url)
    .timeout(TIMEOUT * 1000)
    .get();
Elements elts = doc.getElementsByTag("a");

And here's the example HTML:

<table>
  <tr><td><a href="www.example.com"></a></td></tr>
</table>

My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page?

EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?

like image 491
Marcin Krzysiak Avatar asked Sep 21 '12 08:09

Marcin Krzysiak


People also ask

Is jsoup deprecated?

Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .

Can we use XPath in jsoup?

With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.

What does jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.


1 Answers

In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output doc.toString()).

Tip: use select() instead of getElementsByX(), its faster and more flexible.

Elements elts = doc.select("a"); (edit)

Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax

like image 193
ollo Avatar answered Nov 04 '22 13:11

ollo