I have the problem that i want to filter certain texts which may contain html. I use jsoup to whitelist and clean the tags which works pretty nice.
I only have the problem that some of the tags can contain attributes, mostly style or classes but there could also be different attributes. (name, target, ect.) When cleaning this is no problem because they get stripped nicely but when whitelisting some tags which would be allowed get blocked because of the attributes. The basic whitelist does not seem to cover style or class attributes plus i cannot be shure what else i'm encountering.
Since I want to allow quite a wide range of tags, but remove most of them during cleaning, I don't want to add all attributes for all tags that I'm allowing. The simplest would be to strip all attributes from all tags, since I'm not interested in them anyway and then check if the stripped text with the plain tags is valid.
Is there a function that removes all attributes or some simple loop, another option would be to tell the whitelister to ignore all attributes and simply whitelist on the tags.
Document docsoup = Jsoup. parse(htmlin); docsoup. head(). remove();
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
The solution that finally worked for me is quite simple. I iterate through all elements, then iterate through all attributes and then remove them on the element, which leaves me with a cleaned version where i just have to validate the html-tags themselves. I think this is not the neatest way to solve the problem but it does what I wanted.
** EDIT **
I got upvoted many times for the old code while it actually contained an absolute beginners bug. You can never delete while iterating through the same list. This bug only triggered when more than one attribute was removed, however.
updated code with a bugFix:
Document doc = Jsoup.parseBodyFragment(aText);
Elements el = doc.getAllElements();
for (Element e : el) {
List<String> attToRemove = new ArrayList<>();
Attributes at = e.attributes();
for (Attribute a : at) {
// transfer it into a list -
// to be sure ALL data-attributes will be removed!!!
attToRemove.add(a.getKey());
}
for(String att : attToRemove) {
e.removeAttr(att);
}
}
return Jsoup.isValid(doc.body().html(), theLegalWhitelist);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With