I've written the following PCRE regex to strip scripts from HTML pages: <script.*?>[\s\S]*?< *?\/ *?script *?>
It works on many online PCRE regex testers:
https://regex101.com/r/lsxyI6/1
https://www.regextester.com/?fam=102647
It does NOT work when I run the following perl substitution command in a bash terminal: cat tmp.html | perl -pe 's/<script.*?>[\s\S]*?< *?\/ *?script *?>//g'
I am using the following test data:
<script>
$(document).ready(function() {
var url = window.location.href;
var element = $('ul.nav a').filter(function() {
if (url.charAt(url.length - 1) == '/') {
url = url.substring(0, url.length - 1);
}
return this.href == url;
}).parent();
if (element.is('li')) {
element.addClass('active');
}
});
</script>
P.S. I am using regex to parse HTML because the HTML parser I am forced to use (xmlpath) breaks when there are complex scripts on the page. I am using this regex to remove scripts from the page before passing it to the parser.
By the way, because I find slurping whole files distasteful, and because I don't care what html has to say about line breaks...a quicker, cleaner, more correct way to do this IF you can guarantee there is no important content on <script>
tag lines is:
perl -ne 'print if !(/<script>/../<\/script>/)' tmp.html
(modifying the two regexes to your fancy, of course)
..
is a stateful operator that is flipped on by the expression before it being true and off by the one after being true.
~/test£ cat example.html
<important1/>
<edgecase1/><script></script><edgecase2/>
<important2/>
<script></script>
<important3/>
<script>
<notimportant/>
</script>
~/test£ perl -ne 'print if !(/<script>/../<\/script>/)' example.html
<important1/>
<important2/>
<important3/>
And to (mostly) address content on script tag lines but outside tags:
~/test£ perl -ne 'print if !(/<script>/../<\/script>/);print "$1\n" if /(.+)<script>/;print "$1\n" if /<\/script>(.+)/;' example.html
<important1/>
<edgecase1/>
<edgecase2/>
<important2/>
<important3/>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With