Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx to match string only if it occurs inside a specific HTML element

I'm trying to find certain code portions in a Visual Studio 2013 project. I'm using the RegEx search function for that (I check "Use Regular Expressions" under Search Options).

More specificly, I'm trying to find the string "findthis" (without quotes) that lies between an opening and a closing script tag. The RegEx should be able to match the string multi-line.

Example:

<html>
    <head>
        <script>
            var x = 1;

            if (x < 1) {
                x = 100;
            }

            var y = 'findthis'; // Should be matched
        </script>
    </head>
    <body>
        <script>
            var a = 2;
        </script>

        <h1>Welcome!</h1>
        <p>This findthis here should not be matched.</p>

        <script>
            var b = 'findthis too'; // Should be matched, too.
        </script>

        <div>
            <p>This findthis should not be matched neither.</p>
        </div>
    </body>
</html>

What I've tried so far is the following (the (?s) enables multi-line):

(?s)\<script\>.*?(findthis).*?\</script\>

The problem here is that it does not stop searching for "findthis" when a script end tag occurs. That's why, in Visual Studio 2013, it also shows the script element right after the body opening tag in the search results.

Can anyone help me out of this RegEx hell?

like image 644
thomaskonrad Avatar asked Feb 10 '23 11:02

thomaskonrad


2 Answers

You can use this regex to avoid matching <script> tags:

<script>((?!</?script>).)*(findthis)((?!</?script>).)*</script>

Or, a more effecient one with atomic groupings:

<script>(?>(?!</?script>).)*(findthis)(?>(?!</?script>).)*</script>

I am assuming we do not want to match neither opening, nor closing <script> tags in between, so, I am using /? inside (?>(?!</?script>).)*, just to avoid any other malformed code. I repeat it after (findthis) again, so that we only match characters that are not followed by either <script> or </script>.

Tested in Expresso with a slightly modified input (I added < and > everywhere to simulate corruptions):

enter image description here

like image 57
Wiktor Stribiżew Avatar answered Feb 13 '23 02:02

Wiktor Stribiżew


Built off of @Aaron's answer:

\<script\>(?:[^<]|<(?!\/script>))*?(findthis).*?\<\/script\>

Regular expression visualization

Debuggex Demo

So you can see I do (?:[^<]|<(?!\/script>)) to say "match anything that isn't a <, or a < that isn't followed by /script>".

like image 26
asontu Avatar answered Feb 13 '23 04:02

asontu