I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag. Here's my code snippet: <pre class="prettyprint"><code>Document doc = Jsoup.connect(url) .timeout(TIMEOUT * 1000) .get(); Elements elts = doc.getElementsByTag("a"); </code></pre> And here's the example HTML: <pre class="prettyprint"><code><table> <tr><td><a href="www.example.com"></a></td></tr> </table> </code></pre> My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page? EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?

In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output <code>doc.toString()</code>). Tip: use <code>select()</code> instead of <code>getElementsByX()</code>, its faster and more flexible. <code>Elements elts = doc.select("a");</code> (edit) Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax

Jsoup get all links from a page

Tags:

hyperlink

jsoup

I'm implementing a web robot that has to get all the links from a page and select the needed ones. I got it all working except I encountered a probem where a link is inside a "table" or a "span" tag. Here's my code snippet:

Document doc = Jsoup.connect(url)
    .timeout(TIMEOUT * 1000)
    .get();
Elements elts = doc.getElementsByTag("a");

And here's the example HTML:

<table>
  <tr><td><a href="www.example.com"></a></td></tr>
</table>

My code will not fetch such links. Using doc.select doesn't help too. My question is, how to get all the links from the page?

EDIT: I think I know where the problem is. THe page I'm having trouble with is very badly written, HTML validator throws out tremendous amount of errors. Could this cause problems?

491

asked Sep 21 '12 08:09

Marcin Krzysiak

1 Answers

In general Jsoup can handle moste bad HTML. Dump the HTML as JSoup uses it (you can simple output doc.toString()).

Tip: use select() instead of getElementsByX(), its faster and more flexible.

Elements elts = doc.select("a"); (edit)

Here's an overview about the Selector-API: http://jsoup.org/cookbook/extracting-data/selector-syntax

193

answered Nov 04 '22 13:11

ollo

Related questions
                            
                                Make relative links into absolute ones
                            
                                Open excel file through normal html link
                            
                                Google Apps Script Make text a clickable URL using replaceText()
                            
                                How to embed a link in a UITextView for the iphone
                            
                                Passing parameter to php page without form
                            
                                Clickable links and copy/paste menu in EditView in android
                            
                                How to avoid hyperlink creation when writing down URIs in markdown?
                            
                                Make any link with .pdf open in new window with jQuery?
                            
                                Create hyperlink to another sheet
                            
                                Enable a child control when the parent is disabled
                            
                                edit link does not work for editing comments in ruby on rails blog
                            
                                Creating linked entities in REST api in Symfony
                            
                                How can I map the ctrl click functionality in Eclipse to a middle mouse click?
                            
                                generate url to feature in aws console of a specific account
                            
                                Shortcut/Link to another folder in TFS
                            
                                CSS button/link not registering on some clicks
                            
                                Determine on onclick whether the clicked link is meant to open a new window or tab
                            
                                Keep fullscreen api in fullscreen after clicking on a link

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With