Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accepting relative paths in JSoup clean for <img> tags

Tags:

jsoup

The following is an example of the text I need to parse.

<P>The symbol <IMG id="pic1" height=15 src="images/itemx/image001.gif" width=18>indicates......</P>

I need to perform a cleanup. So applying the following code will remove the src attribute as it doesn't start with a valid protocol. Anyway to configure Jsoup to pickup the attribute? I want to avoid using absolute url if possible.

Jsoup.clean(content, Whitelist.basicWithImages());
like image 778
st1 Avatar asked Jul 09 '11 14:07

st1


3 Answers

The jsoup cleaner will allow relative links, as long as a base URI is specified when cleaning. This is so the link's protocol can be confirmed against the allowed protocols. Note that in your example, you're using the clean method without a base URI, so the link cannot be resolved and so must be removed.

E.g.:

String clean = Jsoup.clean(html, "http://example.com/", 
   Whitelist.basicWithImages());

Note that in the current version, any relative links will be converted to absolute links after cleaning. I've just committed a change (available in the next release) which will optionally allow relative links to be preserved.

Syntax will be:

String clean = Jsoup.clean(html, "http://example.com/",
    Whitelist.basicWithImages().preserveRelativeLinks(true));
like image 114
Jonathan Hedley Avatar answered Oct 16 '22 08:10

Jonathan Hedley


Unfortunately, the accepted answer does not work for me, because I have to support multiple domains (including multiple dev environment and multiple production sites). So we really need the relative URLs (regardless of the dangers that it brings). So here's what I did to do it:

// allow relative URLs. JSoup doesn't support that, so we use reflection
// removing the list of allowed protocols, which means all protocols are allowed
Field field = ReflectionUtils.findField(WHITELIST.getClass(), "protocols");
ReflectionUtils.makeAccessible(field);
ReflectionUtils.setField(field, WHITELIST, Maps.newHashMap());

(ReflectionUtils is a class by spring, which simply wraps the checked exceptions thrown by the reflection API)

like image 36
Bozho Avatar answered Oct 16 '22 09:10

Bozho


This may be helpful:

whitelist.removeProtocols("a", "href", "ftp", "http", "https", "mailto");
whitelist.removeProtocols("img", "src", "http", "https");
like image 33
vasily Avatar answered Oct 16 '22 08:10

vasily