Hi I have a problem with Jsoup.
I scrape a page and get a lot of urls. Some of those are relative urls like: "../index.php"
, "../admin"
, "../details.php"
.
I use attr("abs:href")
to get the absolute url, but this links are rendered like www.domain.com/../admin.php
I would like to know if this is a bug.
Is there a way to get the real absolute path with jsoup? how can I solve this?
I have tried also with absurl("href")
, but not working.
clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.
attr("abs:href") − provides the absolute url after resolving against the document's base URI. link. absUrl("href") − provides the absolute url after resolving against the document's base URI.
What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
also a good option is to use the abs:href or abs:src attributes:
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
this is also described there: http://jsoup.org/cookbook/extracting-data/working-with-urls
If element
contains a relative link you get the absolute link like this: element.absUrl("href")
.
But you have to set the base URI for your relative links before (call eg. setBaseUri("http://www.myexample.com")
on your Document
or Element
).
Make shure your base Uri is long enough!
Good:
element.setBaseUri("http://www.example.com/abc/");
element.attr("href", "../b/here");
returns: http://www.example.com/b/here
Bad:
element.setBaseUri("http://www.example.com/abc/");
element.attr("href", "../../b/here");
returns: http://www.example.com/../b/here
--> your relative link is too long for you base uri!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With