The following is an example of the text I need to parse.
<P>The symbol <IMG id="pic1" height=15 src="images/itemx/image001.gif" width=18>indicates......</P>
I need to perform a cleanup. So applying the following code will remove the src attribute as it doesn't start with a valid protocol. Anyway to configure Jsoup to pickup the attribute? I want to avoid using absolute url if possible.
Jsoup.clean(content, Whitelist.basicWithImages());
The jsoup cleaner will allow relative links, as long as a base URI
is specified when cleaning. This is so the link's protocol can be confirmed against the allowed protocols. Note that in your example, you're using the clean method without a base URI, so the link cannot be resolved and so must be removed.
E.g.:
String clean = Jsoup.clean(html, "http://example.com/",
Whitelist.basicWithImages());
Note that in the current version, any relative links will be converted to absolute links after cleaning. I've just committed a change (available in the next release) which will optionally allow relative links to be preserved.
Syntax will be:
String clean = Jsoup.clean(html, "http://example.com/",
Whitelist.basicWithImages().preserveRelativeLinks(true));
Unfortunately, the accepted answer does not work for me, because I have to support multiple domains (including multiple dev environment and multiple production sites). So we really need the relative URLs (regardless of the dangers that it brings). So here's what I did to do it:
// allow relative URLs. JSoup doesn't support that, so we use reflection
// removing the list of allowed protocols, which means all protocols are allowed
Field field = ReflectionUtils.findField(WHITELIST.getClass(), "protocols");
ReflectionUtils.makeAccessible(field);
ReflectionUtils.setField(field, WHITELIST, Maps.newHashMap());
(ReflectionUtils
is a class by spring, which simply wraps the checked exceptions thrown by the reflection API)
This may be helpful:
whitelist.removeProtocols("a", "href", "ftp", "http", "https", "mailto");
whitelist.removeProtocols("img", "src", "http", "https");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With