Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make a Jsoup whitelist to accept certain attribute content

I'm using Jsoup with relaxed whitelist. It seems perfect but I would like to keep the embedded images tags like <img alt="" src="data:;base64.

Is there a way to modify the whitelist to accept also those img?

Edit:

If I use Whitelist.relaxed().addProtocols("img","src","data") then those img tags are not removed. But it accepts anything after "data:" and I would like just to keep them if src content starts with "data:;base64". Is it possible with jsoup?

like image 980
Federico Pugnali Avatar asked Mar 16 '14 22:03

Federico Pugnali


1 Answers

You can extend Whitelist and override isSafeAttribute to perform custom checks. As there's no way to extend Whitelist.relaxed() directly, you'll have to copy some code to set up the same list:

public class RelaxedPlusDataBase64Images extends Whitelist {
    public RelaxedPlusDataBase64Images() {
        //copied from Whitelist.relaxed()
        addTags("a", "b", "blockquote", "br", "caption", "cite", "code", "col",
                "colgroup", "dd", "div", "dl", "dt", "em", "h1", "h2", "h3", "h4", "h5", "h6",
                "i", "img", "li", "ol", "p", "pre", "q", "small", "strike", "strong",
                "sub", "sup", "table", "tbody", "td", "tfoot", "th", "thead", "tr", "u",
                "ul");
        addAttributes("a", "href", "title");
        addAttributes("blockquote", "cite");
        addAttributes("col", "span", "width");
        addAttributes("colgroup", "span", "width");
        addAttributes("img", "align", "alt", "height", "src", "title", "width");
        addAttributes("ol", "start", "type");
        addAttributes("q", "cite");
        addAttributes("table", "summary", "width");
        addAttributes("td", "abbr", "axis", "colspan", "rowspan", "width");
        addAttributes("th", "abbr", "axis", "colspan", "rowspan", "scope", "width");
        addAttributes("ul", "type");
        addProtocols("a", "href", "ftp", "http", "https", "mailto");
        addProtocols("blockquote", "cite", "http", "https");
        addProtocols("cite", "cite", "http", "https");
        addProtocols("img", "src", "http", "https");
        addProtocols("q", "cite", "http", "https");
    }

    @Override
    protected boolean isSafeAttribute(String tagName, Element el, Attribute attr) {
        return ("img".equals(tagName)
                && "src".equals(attr.getKey())
                && attr.getValue().startsWith("data:;base64")) ||
            super.isSafeAttribute(tagName, el, attr);
    }
}

As you haven't provided the code you're using to parse or the HTML you're sanitizing, I haven't tested this.

like image 139
Jeffrey Bosboom Avatar answered Oct 28 '22 11:10

Jeffrey Bosboom