Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to resolve relative url with Jsoup?

Tags:

java

url

jsoup

Hi I have a problem with Jsoup.

I scrape a page and get a lot of urls. Some of those are relative urls like: "../index.php", "../admin", "../details.php".

I use attr("abs:href") to get the absolute url, but this links are rendered like www.domain.com/../admin.php

I would like to know if this is a bug.

Is there a way to get the real absolute path with jsoup? how can I solve this?

I have tried also with absurl("href"), but not working.

like image 860
Tropicalista Avatar asked Aug 20 '12 16:08

Tropicalista


People also ask

What does jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.

What is ABS href?

attr("abs:href") − provides the absolute url after resolving against the document's base URI. link. absUrl("href") − provides the absolute url after resolving against the document's base URI.

What does jsoup do in Java?

What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.


2 Answers

also a good option is to use the abs:href or abs:src attributes:

String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"

this is also described there: http://jsoup.org/cookbook/extracting-data/working-with-urls

like image 77
rbs Avatar answered Oct 05 '22 10:10

rbs


If element contains a relative link you get the absolute link like this: element.absUrl("href").

But you have to set the base URI for your relative links before (call eg. setBaseUri("http://www.myexample.com") on your Document or Element).

Make shure your base Uri is long enough!

Good:

element.setBaseUri("http://www.example.com/abc/");
element.attr("href", "../b/here");

returns: http://www.example.com/b/here

Bad:

element.setBaseUri("http://www.example.com/abc/");
element.attr("href", "../../b/here");

returns: http://www.example.com/../b/here

--> your relative link is too long for you base uri!

like image 33
ollo Avatar answered Oct 05 '22 09:10

ollo