Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expressions negative lookahead

I'm doing some regular expression gymnastics. I set myself the task of trying to search for C# code where there is a usage of the as-operator not followed by a null-check within a reasonable amount of space. Now I don't want to parse the C# code. E.g. I want to capture code snippets such as

    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(x1.a == y1.a)

however, not capture

    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(x1 == null)

nor for that matter

    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(somethingunrelated == null) {...}
    if(x1.a == y1.a)

Thus any random null-check will count as a "good check" and hence not found.

The question is: How do I match something while ensuring something else is not found in its sourroundings.

I've tried the naive approach, looking for 'as' then doing a negative lookahead within a 150 characters.

\bas\b.{1,150}(?!\b==\s*null\b)

The above regular expression matches all of the above examples infortunately. My gut tells me, the problem is that the looking ahead and then doing negative lookahead can find many situations where the lookahead does not find the '== null'.

If I try negating the whole expression, then that doesn't help either, at that would match most C# code around.

like image 801
Carlo V. Dango Avatar asked Feb 07 '11 12:02

Carlo V. Dango


5 Answers

I love regex gymnastics! Here is a commented PHP regex:

$re = '/# Find all AS, (but not preceding a XX == null).
    \bas\b               # Match "as"
    (?=                  # But only if...
      (?:                # there exist from 1-150
        [\S\s]           # chars, each of which
        (?!==\s*null)    # are NOT preceding "=NULL"
      ){1,150}?          # (and do this lazily)
      (?:                # We are done when either
        (?=              # we have reached
          ==\s*(?!null)  # a non NULL conditional
        )                #
      | $                # or the end of string.
      )
    )/ix'

And here it is in Javascript style:

re = /\bas\b(?=(?:[\S\s](?!==\s*null)){1,150}?(?:(?===\s*(?!null))|$))/ig;

This one did make my head hurt a little...

Here is the test data I am using:

text = r"""    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(x1.a == y1.a)

however, not capture
    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(x1 == null)

nor for that matter
    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(somethingunrelated == null) {...}
    if(x1.a == y1.a)"""
like image 104
ridgerunner Avatar answered Nov 17 '22 07:11

ridgerunner


Put the .{1,150} inside the lookahead, and replace . with \s\S (in general, . doesn't match newlines). Also, the \b might be misleading near the ==.

\bas\b(?![\s\S]{1,150}==\s*null\b)
like image 22
robert Avatar answered Nov 17 '22 07:11

robert


I think it would help to put the variable name into () so you can use it as a back reference. Something like the following,

\b(\w+)\b\W*=\W*\w*\W*\bas\b[\s\S]{1,150}(?!\b\1\b\W*==\W*\bnull\b)
like image 24
David Winant Avatar answered Nov 17 '22 05:11

David Winant


The question isn't clear. What do you want EXACTLY ? I regret, but I still don't understand, after having read the question and comments numerous times.

.

Must the code be in C# ? In Python ? Other ? There is no indication concerning this point

.

Do you want a matching only if a if(... == ...) line follows a block of var ... = ... lines ?

Or may an heterogenous line be BETWEEN the block and the if(... == ...) line without stopping the matching ?

My code takes the second option as true.

.

Does a if(... == null) line AFTER a if(... == ...) line stop the matchin or not ?

Unable to understand if it is yes or no, I defined the two regexes to catch these two options.

.

I hope my code will be clear enough and answering to your preoccupation.

It is in Python

import re

ch1 ='''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
if(x1.a == y1.a)
1618987987849891
'''

ch2 ='''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
uydtdrdutdutrr
if(x1.a == y1.a)
3213546878'''

ch3='''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
if(x1 == null)
165478964654456454'''

ch4='''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
hgyrtdduihudgug
if(x1 == null)
165489746+54646544'''

ch5='''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
if(somethingunrelated == null ) {...}
if(x1.a == y1.a)
1354687897'''

ch6='''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
ifughobviudyhogiuvyhoiuhoiv
if(somethingunrelated == null ) {...}
if(x1.a == y1.a)
2468748874897498749874897'''

ch7 = '''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
if(x1.a == y1.a)
iufxresguygo
liygcygfuihoiuguyg
if(somethingunrelated == null ) {...}
oufxsyrtuy
'''

ch8 = '''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
tfsezfuytfyfy
if(x1.a == y1.a)
iufxresguygo
liygcygfuihoiuguyg
if(somethingunrelated == null ) {...}
oufxsyrtuy
'''

ch9 = '''kutgdfxfovuyfuuff
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
tfsezfuytfyfy
if(x1.a == y1.a)
if(somethingunrelated == null ) {...}
oufxsyrtuy
'''

pat1 = re.compile(('('
                   '(^var +\S+ *= *\S+ +as .+[\r\n]+)+?'
                   '([\s\S](?!==\s*null\\b))*?'
                   '^if *\( *[^\s=]+ *==(?!\s*null).+$'
                   ')'
                   ),
                  re.MULTILINE)

pat2 = re.compile(('('
                   '(^var +\S+ *= *\S+ +as .+[\r\n]+)+?'
                   '([\s\S](?!==\s*null\\b))*?'
                   '^if *\( *[^\s=]+ *==(?!\s*null).+$'
                   ')'
                   '(?![\s\S]{0,150}==)'
                   ),
                  re.MULTILINE)


for ch in (ch1,ch2,ch3,ch4,ch5,ch6,ch7,ch8,ch9):
    print pat1.search(ch).group() if pat1.search(ch) else pat1.search(ch)
    print
    print pat2.search(ch).group() if pat2.search(ch) else pat2.search(ch)
    print '-----------------------------------------'

Result

>>> 
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
if(x1.a == y1.a)

var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
if(x1.a == y1.a)
-----------------------------------------
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
uydtdrdutdutrr
if(x1.a == y1.a)

var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
uydtdrdutdutrr
if(x1.a == y1.a)
-----------------------------------------
None

None
-----------------------------------------
None

None
-----------------------------------------
None

None
-----------------------------------------
None

None
-----------------------------------------
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
if(x1.a == y1.a)

None
-----------------------------------------
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
tfsezfuytfyfy
if(x1.a == y1.a)

None
-----------------------------------------
var x1 = x as SimpleRes;
var y1 = y as SimpleRes;
tfsezfuytfyfy
if(x1.a == y1.a)

None
-----------------------------------------
>>> 
like image 2
eyquem Avatar answered Nov 17 '22 05:11

eyquem


Let me try to redefine your problem:

  1. Look for an "as" assignment -- you probably needs a better regex to look for actual assignments and may want to store the expression assigned, but let's use "\bas\b" for now
  2. If you see an if (... == null) within 150 characters, don't match
  3. If you don't see an if (... == null) within 150 characters, match

Your expression \bas\b.{1,150}(?!\b==\s*null\b) won't work because of the negative look-ahead. The regex can always skip ahead or behind one letter in order to avoid this negative look-ahead and you end up matching even when there is an if (... == null) there.

Regex's are really not good at not matching something. In this case, you're better of trying to match an "as" assignment with an "if == null" check within 150 characters:

\bas\b.{1,150}\b==\s*null\b

and then negating the check: if (!regex.match(text)) ...

like image 2
Stephen Chung Avatar answered Nov 17 '22 05:11

Stephen Chung