Regular expression to remove HTML tags


I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: <a href="blah">blah</a> it leaves the <a/>.

I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.

Here is my code:

  string sPattern = @"<\/?!?(img|a)[^>]*>";   Regex rgx = new Regex(sPattern);   Match m = rgx.Match(sSummary);   string sResult = "";   if (m.Success)    sResult = rgx.Replace(sSummary, "", 1); 

I am looking to remove the first occurence of the <a> and <img> tags.

2 Answers

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

  • http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<"; var regex = new Regex(pattern); var m = regex.Match(sSummary); if ( m.Success ) {    sResult = m.Groups["content"].Value; 
To turn this:


into this:

'mamma papa' 

You need to replace the tags with spaces:

.replace(/<[^>]*>/g, ' ') 

and reduce any duplicate spaces into single spaces:

.replace(/\s{2,}/g, ' ') 

then trim away leading and trailing spaces with:


Meaning that your remove tag function look like this:

function removeTags(string){   return string.replace(/<[^>]*>/g, ' ')                .replace(/\s{2,}/g, ' ')                .trim(); } 
