Removing Html tags except few specific ones from String in java

Question

My input is plain text string and requirement is to remove all html tags except few specific tags like:

<p>
<li>
<u>
<li>

If these specific tags have attributes like class or id, I want to remove these attributes.

A few examples:

<a href = "#">Link</a>            ->   Link

<p>paragraph</p>                  ->   <p>paragraph</p>

<p class="class1">paragraph</p>   ->   <p>paragraph</p>

I have gone through this Remove HTML tags from a String but it does not answer my question completely.

Can it be handled by a set of regex's or could I make use of some library?

RandomQuestion · Accepted Answer

I tried JSoup and It seems to be able to handle all such cases. Here is example code.

 public String clean(String unsafe){
        Whitelist whitelist = Whitelist.none();
        whitelist.addTags(new String[]{"p","br","ul"});

        String safe = Jsoup.clean(unsafe, whitelist);
        return StringEscapeUtils.unescapeXml(safe);
 }

For input string

String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";

I get following output which is pretty much I require.

<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>

beny23 · Answer

For simple HTML, this may be sufficient:

// remove any <script> tags
html = html.replaceAll("(?i)<script.*?</script>", ""));
// this removes any attributes
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\s[^>]*)>", "<$1>"));
// this removes any tags (not li and p)
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));

Hope that helps.

Removing Html tags except few specific ones from String in java

Tags:

java

html

RandomQuestion

2 Answers

RandomQuestion

beny23

Recent Activity

Donate For Us

Removing Html tags except few specific ones from String in java

Tags:

java

html

RandomQuestion

2 Answers

RandomQuestion

beny23

Related questions

Recent Activity

Donate For Us