Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to remove HTML tags

Tags:

I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: <a href="blah">blah</a> it leaves the <a/>.

I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.

Here is my code:

  string sPattern = @"<\/?!?(img|a)[^>]*>";   Regex rgx = new Regex(sPattern);   Match m = rgx.Match(sSummary);   string sResult = "";   if (m.Success)    sResult = rgx.Replace(sSummary, "", 1); 

I am looking to remove the first occurence of the <a> and <img> tags.

like image 569
LilMoke Avatar asked Sep 24 '10 20:09

LilMoke


People also ask

How do you remove tags in HTML?

Approach: Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.

What is HTML regex?

Regular expressions, or regex for short, are a series of special characters that define a search pattern. These expressions can remove lengthy validation functions and replace them with simple expressions.

How do you remove all HTML tags from a string in react?

To remove html tags from string in react js, just use the /(<([^>]+)>)/ig regex with replace() method it will remove tags with their attribute and return new string.

Which function is used to remove all HTML tags from string?

The strip_tags() function strips a string from HTML, XML, and PHP tags.


2 Answers

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

  • http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<"; var regex = new Regex(pattern); var m = regex.Match(sSummary); if ( m.Success ) {    sResult = m.Groups["content"].Value; 
like image 71
JaredPar Avatar answered Sep 24 '22 20:09

JaredPar


To turn this:

'<td>mamma</td><td><strong>papa</strong></td>' 

into this:

'mamma papa' 

You need to replace the tags with spaces:

.replace(/<[^>]*>/g, ' ') 

and reduce any duplicate spaces into single spaces:

.replace(/\s{2,}/g, ' ') 

then trim away leading and trailing spaces with:

.trim(); 

Meaning that your remove tag function look like this:

function removeTags(string){   return string.replace(/<[^>]*>/g, ' ')                .replace(/\s{2,}/g, ' ')                .trim(); } 
like image 33
Johs Avatar answered Sep 25 '22 20:09

Johs