I am trying to write a regular expression in C# to remove all script tags and anything contained within them.
So far I have come up with the following: \<([^:]*?:)?script\>[^(\</<([^:]*?:)?script\>)]*?\</script\>
, however this does not work.
I'll break it up and explain my thinking in each section:
\<([^:]*?:)?script\>
Here I am trying to state that it should get any script element, even if it is prefixed with a namespace, say, <a:script></a:script>.
I have also added this to the closing tag.
[^(\</<([^:]*?:)?script\>)]*?
Here I am trying to state that it should allow anything to be contained within the tags except for </a:script>
, </script>
, etc.
\</script\>
Here I am stating that it should have a closing tag.
Can anyone spot where I am going wrong?
This regular expression does the trick just fine:
\<(?:[^:]+:)?script\>.*?\<\/(?:[^:]+:)?script\>
You will run into a problem by this simple HTML:
<script>
var s = "<script></script>";
</script>
How are you going to solve this problem? It is smarter to use the HTML Agility Pack for such things.
You can't parse HTML with regular expressions.
Use the HTML Agility Pack instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With