Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup attribute removal on html tags

Tags:

java

jsoup

I have the problem that i want to filter certain texts which may contain html. I use jsoup to whitelist and clean the tags which works pretty nice.

I only have the problem that some of the tags can contain attributes, mostly style or classes but there could also be different attributes. (name, target, ect.) When cleaning this is no problem because they get stripped nicely but when whitelisting some tags which would be allowed get blocked because of the attributes. The basic whitelist does not seem to cover style or class attributes plus i cannot be shure what else i'm encountering.

Since I want to allow quite a wide range of tags, but remove most of them during cleaning, I don't want to add all attributes for all tags that I'm allowing. The simplest would be to strip all attributes from all tags, since I'm not interested in them anyway and then check if the stripped text with the plain tags is valid.

Is there a function that removes all attributes or some simple loop, another option would be to tell the whitelister to ignore all attributes and simply whitelist on the tags.

like image 599
Xtroce Avatar asked Aug 14 '13 14:08

Xtroce


People also ask

How do I delete a tag in jsoup?

Document docsoup = Jsoup. parse(htmlin); docsoup. head(). remove();

What is jsoup parse?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is a jsoup document?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.


1 Answers

The solution that finally worked for me is quite simple. I iterate through all elements, then iterate through all attributes and then remove them on the element, which leaves me with a cleaned version where i just have to validate the html-tags themselves. I think this is not the neatest way to solve the problem but it does what I wanted.

** EDIT **

I got upvoted many times for the old code while it actually contained an absolute beginners bug. You can never delete while iterating through the same list. This bug only triggered when more than one attribute was removed, however.

updated code with a bugFix:

Document doc = Jsoup.parseBodyFragment(aText);
Elements el = doc.getAllElements();
for (Element e : el) {
    List<String>  attToRemove = new ArrayList<>();
    Attributes at = e.attributes();
    for (Attribute a : at) {
        // transfer it into a list -
        // to be sure ALL data-attributes will be removed!!!
        attToRemove.add(a.getKey());
    }

    for(String att : attToRemove) {
        e.removeAttr(att);
   }
}


return Jsoup.isValid(doc.body().html(), theLegalWhitelist);
like image 199
Xtroce Avatar answered Oct 30 '22 17:10

Xtroce