Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Some HTML tags with RegExp and Java

Tags:

java

html

regex

I want to remove HTML tags from a String. This is easy, I know, I did so:

public String removerTags(String html)  
    {  
        return html.replaceAll("\\<(/?[^\\>]+)\\>", " ").replaceAll("\\s+", " ").trim();  
    }  

The problem is that I do not want to remove all the tags .. I want the tag

<span style=\"background-color: yellow\"> (text) </ span>

stay intact in the string ..

I'm using this as a kind of "highlight" in the search for a web application using GWT I'm doing ...

And I need to do this, because if the search finds text that contains some HTML tag (the indexing is done by Lucene), and it is broken, the appendHTML from safeHTMLBuilder are unable to mount a String.

You can do this in a way fairly good?

Hugs.

like image 237
caarlos0 Avatar asked Sep 08 '11 11:09

caarlos0


1 Answers

I strongly suggest you use JSoup for this task. Regular expressions simply aren't well suited for this task imo. And with JSoup this is basically a simple, readable and easily maintainable one-liner!

Have a look at the JSoup.clean method, and perhaps this article:

  • Sanitize Untrusted HTML
like image 149
aioobe Avatar answered Sep 20 '22 21:09

aioobe