I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag. Here's my code snippet:
Document doc = Jsoup.connect(url)
.timeout(TIMEOUT * 1000)
.get();
Elements elts = doc.getElementsByTag("a");
And here's the example HTML:
<table>
<tr><td><a href="www.example.com"></a></td></tr>
</table>
My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page?
EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?
Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .
With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.
clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.
In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output doc.toString()
).
Tip: use select()
instead of getElementsByX()
, its faster and more flexible.
Elements elts = doc.select("a");
(edit)
Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With