Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Removing Html tags except few specific ones from String in java




My input is plain text string and requirement is to remove all html tags except few specific tags like:


If these specific tags have attributes like class or id, I want to remove these attributes.

A few examples:

<a href = "#">Link</a>            ->   Link

<p>paragraph</p>                  ->   <p>paragraph</p>

<p class="class1">paragraph</p>   ->   <p>paragraph</p>

I have gone through this Remove HTML tags from a String but it does not answer my question completely.

Can it be handled by a set of regex's or could I make use of some library?

like image 499
RandomQuestion Avatar asked Dec 02 '22 02:12


2 Answers

I tried JSoup and It seems to be able to handle all such cases. Here is example code.

 public String clean(String unsafe){
        Whitelist whitelist = Whitelist.none();
        whitelist.addTags(new String[]{"p","br","ul"});

        String safe = Jsoup.clean(unsafe, whitelist);
        return StringEscapeUtils.unescapeXml(safe);

For input string

String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";

I get following output which is pretty much I require.

<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>
like image 88
RandomQuestion Avatar answered Dec 09 '22 13:12


For simple HTML, this may be sufficient:

// remove any <script> tags
html = html.replaceAll("(?i)<script.*?</script>", ""));
// this removes any attributes
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\\s[^>]*)>", "<$1>"));
// this removes any tags (not li and p)
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));

Hope that helps.

like image 40
beny23 Avatar answered Dec 09 '22 12:12
