Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove all html tags except img?

I got some html text, which contains all kinds of html tags, such as <table>, <a>, <img>, and so on.

Now I want to use a regular expression to remove all the html tags, except <img ...> and </img>(and upper case <IMG></IMG>).

How to do this?


UPDATE:

My task is very simple, it just print the text content(including images) of a html as a summary in the front page, so I think regular expression is good and simple enough.


UPDATE AGAIN

Maybe a sample will make my question better to understand :)

There are some html text:

<html>
  <head></head>
  <body>
     Hello, everyone. Here is my photo: <img src="xxx.jpg" />. 
     And, <a href="xxx">know more</a> about me!
  </body>
</html>

I want to keep , and remove other tags. Following is what I want:

Hello, everyone. Here is my photo: <img src="xxx.jpg" />. And, know more about me!

Now I code like this:

html.replaceAll("<.*?>", "")

But it will remove all the content between < and >, but I want to keep <img xxx> and </img>, and remove the other content between < and >

Thank for everyone!

like image 244
Freewind Avatar asked Dec 02 '22 05:12

Freewind


1 Answers

I tried a lot, this regular expression seems work for me:

(?i)<(?!img|/img).*?>

My code is:

html.replaceAll('(?i)<(?!img|/img).*?>', '');
like image 156
Freewind Avatar answered Dec 17 '22 20:12

Freewind