Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression for Extracting Script Tags

Tags:

c#

regex

I am trying to write a regular expression in C# to remove all script tags and anything contained within them.

So far I have come up with the following: \<([^:]*?:)?script\>[^(\</<([^:]*?:)?script\>)]*?\</script\>, however this does not work.

I'll break it up and explain my thinking in each section:

\<([^:]*?:)?script\>

Here I am trying to state that it should get any script element, even if it is prefixed with a namespace, say, <a:script></a:script>. I have also added this to the closing tag.

[^(\</<([^:]*?:)?script\>)]*?

Here I am trying to state that it should allow anything to be contained within the tags except for </a:script>, </script>, etc.

\</script\>

Here I am stating that it should have a closing tag.

Can anyone spot where I am going wrong?

like image 843
TheBoss Avatar asked Jan 13 '11 17:01

TheBoss


2 Answers

This regular expression does the trick just fine:

\<(?:[^:]+:)?script\>.*?\<\/(?:[^:]+:)?script\>

But don't do it please

You will run into a problem by this simple HTML:

<script>
var s = "<script></script>";
</script>

How are you going to solve this problem? It is smarter to use the HTML Agility Pack for such things.

like image 85
Robert Koritnik Avatar answered Nov 05 '22 10:11

Robert Koritnik


You can't parse HTML with regular expressions.

Use the HTML Agility Pack instead.

like image 34
Tim Robinson Avatar answered Nov 05 '22 10:11

Tim Robinson