Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove all inline styles and other attributes from html elements using Jsoup?

How to remove all inline styles and other attributes(class,onclick) from html elements using Jsoup?

Sample Input :

<div style="padding-top:25px;" onclick="javascript:alert('hi');">
This is a sample div <span class='sampleclass'> This is a sample span </span>
</div>

Sample Output :

<div>This is a sample div <span> This is a sample span </span> </div>

My Code (Is this is a right way or any other better approach is there?)

Document doc = Jsoup.parse(html);
Elements el = doc.getAllElements();
for (Element e : el) {
    Attributes at = e.attributes();
    for (Attribute a : at) {    
        e.removeAttr(a.getKey());    
    }
}
like image 657
vjy Avatar asked Dec 19 '22 22:12

vjy


1 Answers

Yes, one method is indeed to iterate through the elements and call removeAttr();

An alternative method using jsoup is to make use of the Whitelist class (see docs), which can be used with the Jsoup.clean() function to remove any non-specified tags or attributes from the document.

For example:

String html = "<html><head></head><body><div style='padding-top:25px;' onclick='javascript.alert('hi');'>This is a sample div <span class='sampleclass'>This is a simple span</span></div></body></html>";

Whitelist wl = Whitelist.simpleText();
wl.addTags("div", "span"); // add additional tags here as necessary
String clean = Jsoup.clean(html, wl);
System.out.println(clean);

Will result in the following output:

11-05 19:56:39.302: I/System.out(414): <div>
11-05 19:56:39.302: I/System.out(414):  This is a sample div 
11-05 19:56:39.302: I/System.out(414):  <span>This is a simple span</span>
11-05 19:56:39.302: I/System.out(414): </div>
like image 103
ashatte Avatar answered May 06 '23 07:05

ashatte