Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing Html tags except few specific ones from String in java

Tags:

java

html

My input is plain text string and requirement is to remove all html tags except few specific tags like:

<p>
<li>
<u>
<li>

If these specific tags have attributes like class or id, I want to remove these attributes.

A few examples:

<a href = "#">Link</a>            ->   Link

<p>paragraph</p>                  ->   <p>paragraph</p>

<p class="class1">paragraph</p>   ->   <p>paragraph</p>

I have gone through this Remove HTML tags from a String but it does not answer my question completely.

Can it be handled by a set of regex's or could I make use of some library?

like image 499
RandomQuestion Avatar asked Dec 02 '22 02:12

RandomQuestion


2 Answers

I tried JSoup and It seems to be able to handle all such cases. Here is example code.

 public String clean(String unsafe){
        Whitelist whitelist = Whitelist.none();
        whitelist.addTags(new String[]{"p","br","ul"});

        String safe = Jsoup.clean(unsafe, whitelist);
        return StringEscapeUtils.unescapeXml(safe);
 }

For input string

String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";

I get following output which is pretty much I require.

<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>
like image 88
RandomQuestion Avatar answered Dec 09 '22 13:12

RandomQuestion


For simple HTML, this may be sufficient:

// remove any <script> tags
html = html.replaceAll("(?i)<script.*?</script>", ""));
// this removes any attributes
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\\s[^>]*)>", "<$1>"));
// this removes any tags (not li and p)
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));

Hope that helps.

like image 40
beny23 Avatar answered Dec 09 '22 12:12

beny23