Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl regex working in online PCRE tester but not in perl command

I've written the following PCRE regex to strip scripts from HTML pages: <script.*?>[\s\S]*?< *?\/ *?script *?>

It works on many online PCRE regex testers:

https://regex101.com/r/lsxyI6/1

https://www.regextester.com/?fam=102647

It does NOT work when I run the following perl substitution command in a bash terminal: cat tmp.html | perl -pe 's/<script.*?>[\s\S]*?< *?\/ *?script *?>//g'

I am using the following test data:

<script>
                       $(document).ready(function() {
                           var url = window.location.href;
                           var element = $('ul.nav a').filter(function() {
                               if (url.charAt(url.length - 1) == '/') {
                                   url = url.substring(0, url.length - 1);
                               }

                               return this.href == url;
                           }).parent();

                           if (element.is('li')) {
                               element.addClass('active');
                           }
                       });
                   </script>

P.S. I am using regex to parse HTML because the HTML parser I am forced to use (xmlpath) breaks when there are complex scripts on the page. I am using this regex to remove scripts from the page before passing it to the parser.

like image 442
nulldev Avatar asked Dec 10 '22 07:12

nulldev


1 Answers

By the way, because I find slurping whole files distasteful, and because I don't care what html has to say about line breaks...a quicker, cleaner, more correct way to do this IF you can guarantee there is no important content on <script> tag lines is:

perl -ne 'print if !(/<script>/../<\/script>/)' tmp.html

(modifying the two regexes to your fancy, of course) .. is a stateful operator that is flipped on by the expression before it being true and off by the one after being true.

~/test£ cat example.html
<important1/>
<edgecase1/><script></script><edgecase2/>
<important2/>
<script></script>
<important3/>
<script>
<notimportant/>
</script>

~/test£ perl -ne 'print if !(/<script>/../<\/script>/)' example.html
<important1/>
<important2/>
<important3/>

And to (mostly) address content on script tag lines but outside tags:

~/test£ perl -ne 'print if !(/<script>/../<\/script>/);print "$1\n" if /(.+)<script>/;print "$1\n" if /<\/script>(.+)/;' example.html
<important1/>
<edgecase1/>
<edgecase2/>
<important2/>
<important3/>
like image 194
zzxyz Avatar answered Jan 26 '23 00:01

zzxyz