Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

correct usage of carets inside negative lookahead expression in perl

Tags:

regex

perl

I am trying to match any word that is not completely composed of capitals or lowercase letters, and I have the following regex written:

if ($line =~ /(?!^[A-Z][A-Z]+(\s*)$)(?!^[a-z][a-z]+(\s*)$)/) {
    print $line;
}

The expression below should match words with all capital letters

(?!^[A-Z][A-Z]+(\s*)$) 

and this should match words with all lowercase letters

(?!^[a-z][a-z]+(\s*)$)

I combine both and try to match this with the following words, ASDSFSDF, asdfasdfasdf, and asdasdfFFFdsfs. I notice that it is matching everything. only when i move the caret outside the brackets as in:

^(?![A-Z][A-Z]+(\s*)$)^(?![a-z][a-z]+(\s*)$)/)

do i see that its only maching the asdasdfFFFdsfs. can someone explain to me why i need to move the operator outside of the negative lookahead expression? i am new to regexp and i am confused.

Thanks.

like image 749
mlikj2006 Avatar asked Sep 15 '13 22:09

mlikj2006


2 Answers

You fell in a trap of multiple negations and anchoring, and you resulting regex didn't quite do what you want. Let's assume we only have the simplified regex /(?!^[A-Z]$)/ and the string "1".

At the first position (before the 1), the assertion is tested. The ^ matches here, but [A-Z] does not. Therefore, ^[A-Z] fails. As the lookahead is negative, the whole pattern succeeds.

Now let's assume we have the string "A". At the first position, the assertion is tested. The pattern ^[A-Z]$ matches here. Because it is a negative lookahead, the assertion fails.

Then, the second position is tested (after the A). The assertion is tested, but ^ doesn't match here – thus the negative assertion makes the pattern succeed!

Therefore, your regex doesn't match the patterns you wanted. You can suppress this behaviour by anchoring outside the assertion:

/^(?![A-Z]$)/

in this case. Note that in your case, the easiest solution is to write a regex that matches all inputs you don't want, and the negating that result:

print $line unless $line =~ /^(?:[A-Z]{2,}|[a-z]{2,})\s*$/;

(Edit: actually TLP's 2nd solution is even simpler, and likely more efficient)

like image 176
amon Avatar answered Nov 15 '22 04:11

amon


How about just checking the string for the upper and lower case characters?

(?=.*[A-Z])(?=.*[a-z])

As you see, this will not match strings consisting of only one case, because both lookaheads must match.

Of course, this is just a complicated way of performing two regex matches and combining the result:

if ($line =~ /[A-Z]/ and $line =~ /[a-z]/)
like image 40
TLP Avatar answered Nov 15 '22 05:11

TLP