Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSoup.clean() is not preserving relative URLs

I have tried:

Whitelist.relaxed();
Whitelist.relaxed().preserveRelativeLinks(true);
Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp");
Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp").preserveRelativeLinks(true);

None of them work: When I try to clean a relative url, like <a href="/test.xhtml">test</a> I get the href attribute removed (<a>test</a>).

I am using JSoup 1.8.2.

Any ideas?

like image 452
NotGaeL Avatar asked Feb 22 '16 20:02

NotGaeL


1 Answers

The problem most likely stems from the call of the clean method. If you give the base URI all should work as expected:

String html = ""
        + "<a href=\"/test.xhtml\">test</a>"
        + "<invalid>stuff</invalid>"
        + "<h2>header1</h2>";
String cleaned = Jsoup.clean(html, "http://base.uri", Whitelist.relaxed().preserveRelativeLinks(true));
System.out.println(cleaned);

The above works and keeps the relative links. With String cleaned = Jsoup.clean(html, Whitelist.relaxed().preserveRelativeLinks(true)) however the link is deleted.

Note the documentation of Whitelist.preserveRelativeLinks(true):

Note that when handling relative links, the input document must have an appropriate base URI set when parsing, so that the link's protocol can be confirmed. Regardless of the setting of the preserve relative links option, the link must be resolvable against the base URI to an allowed protocol; otherwise the attribute will be removed.

like image 155
luksch Avatar answered Sep 21 '22 02:09

luksch