Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert relative to absolute links using jsoup

I'm using jsoup to clean a html page, the problem is that when I save the html locally, the images do not show because they are all relative links.

Here's some example code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class so2 {

    public static void main(String[] args) {

        String html = "<html><head><title>The Title</title></head>"
                  + "<body><p><a href=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" target=\"_blank\"><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></a></p></body></html>";
        Document doc = Jsoup.parse(html,"https://whatever.com"); // baseUri seems to be ignored??

        System.out.println(doc);        
    }
}

Output:

<html>
 <head>
  <title>The Title</title>
 </head>
 <body>
  <p><a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" target="_blank"><img width="437" src="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" height="418" class="documentimage"></a></p>
 </body>
</html>

The output still shows the links as a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif".

I would like it to convert them to a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"

Can anyone show me how to get jsoup to convert all the links to absolute links?

like image 765
localhost Avatar asked Nov 16 '14 11:11

localhost


People also ask

Can we use XPath in jsoup?

With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.

What does jsoup parse do?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is abs href?

attr("abs:href") − provides the absolute url after resolving against the document's base URI. link. absUrl("href") − provides the absolute url after resolving against the document's base URI.

What is relative and absolute link?

A URL specifies the location of a target stored on a local or networked computer. The target can be a file, directory, HTML page, image, program, and so on. An absolute URL contains all the information necessary to locate a resource. A relative URL locates a resource using an absolute URL as a starting point.


1 Answers

You can select all the links and transform their hrefs to absolute using Element.absUrl()

Example in your code:

EDIT (added processing of images)

public static void main(String[] args) {

    String html = "<html><head><title>The Title</title></head>"
              + "<body><p><a href=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" target=\"_blank\"><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></a></p></body></html>";
    Document doc = Jsoup.parse(html,"https://whatever.com"); 

    Elements select = doc.select("a");
    for (Element e : select){
        // baseUri will be used by absUrl
        String absUrl = e.absUrl("href");
        e.attr("href", absUrl);
    }

    //now we process the imgs
    select = doc.select("img");
    for (Element e : select){
        e.attr("src", e.absUrl("src"));
    }

    System.out.println(doc);        
}
like image 73
fonkap Avatar answered Sep 30 '22 11:09

fonkap