I'm using jsoup to clean a html page, the problem is that when I save the html locally, the images do not show because they are all relative links.
Here's some example code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class so2 {
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><a href=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" target=\"_blank\"><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></a></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com"); // baseUri seems to be ignored??
System.out.println(doc);
}
}
Output:
<html>
<head>
<title>The Title</title>
</head>
<body>
<p><a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" target="_blank"><img width="437" src="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" height="418" class="documentimage"></a></p>
</body>
</html>
The output still shows the links as a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"
.
I would like it to convert them to a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"
Can anyone show me how to get jsoup to convert all the links to absolute links?
With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
attr("abs:href") − provides the absolute url after resolving against the document's base URI. link. absUrl("href") − provides the absolute url after resolving against the document's base URI.
A URL specifies the location of a target stored on a local or networked computer. The target can be a file, directory, HTML page, image, program, and so on. An absolute URL contains all the information necessary to locate a resource. A relative URL locates a resource using an absolute URL as a starting point.
You can select all the links and transform their hrefs to absolute using Element.absUrl()
Example in your code:
EDIT (added processing of images)
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><a href=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" target=\"_blank\"><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></a></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com");
Elements select = doc.select("a");
for (Element e : select){
// baseUri will be used by absUrl
String absUrl = e.absUrl("href");
e.attr("href", absUrl);
}
//now we process the imgs
select = doc.select("img");
for (Element e : select){
e.attr("src", e.absUrl("src"));
}
System.out.println(doc);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With