Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex - confused about lookaround functionality

If I write

(?<=\()\w+(?=\))

for this string: (Test) (Test2) (Test3)

I will get: Test Test2 Test3

That makes sense.

If I write

\w+ (?<=\()\w+(?=\))

For this string: LTE (Test)

it returns nothing.. What's the problem here?

Please explain your regex clearly since it can be hard to read.

like image 516
hamobi Avatar asked Aug 14 '13 16:08

hamobi


2 Answers

Lookarounds do not consume characters!

Here's a step by step way to see it (might not be the best, but that's how I interpret it anyway):

First character is L, the regex engine compares it with \w+ and agrees that it's a match. Same happens for T, then E.

At the space, the regex engine sees a space in the regular expression, that's fine as well.

Next up is the opening paren, but what does the regex see? Remember that lookarounds do not consume characters, so that the \( in (?<=\() is not actually being consumed and \( does not match what \w+ matches!

You might think about the regex actually consuming those characters: \w+ \w+, but with a condition on the second \w+, that it must be found between parens. The condition might be satisfied, but the expression itself does not match any parentheses!

To make it match, you should add the parens:

\w+ \((?<=\()\w+(?=\))\)

After seeing and matching the space, the regex engine sees (, which agrees with the provided expression, it moves forward.

The engine then sees T. First, does it match the next character, \w+? Yes, second, is there an opening paren before it? Yes.

Before moving forward, it sees a positive lookahead. Is there a closing paren just ahead? No, there's e, but \w+ can still be satisfied, so it matches e with another \w. This goes on like this until t. Is there a closing paren after t? Yes, thus proceed to next check.

It encounters a closing paren, which is matched by the closing paren in the expression (note that the literal closing paren could be dropped here, and you will be matching LTE (Test instead).

But with all this, it might be just as good to have dropped the lookarounds:

\w+ \(\w+\)

Because they add more strain on the engine and even though it's not that visible on small scale, it can be significant on a larger string.

Hopefully, it helps, even if it's a little bit!

like image 165
Jerry Avatar answered Oct 18 '22 14:10

Jerry


Lookahead and lookbehind are "zero-width assertions", they do not consume characters in the string, but only assert whether a match is possible or not. Your second pattern tries to find a <word1><space><word2> structure, but it also expects that <word2> is surrounded by parentheses. It won't match on anything, since the only character it accepts before <word2> is a <space>! I would simply write the parentheses directly into the pattern: (\w+) \((\w+)\). I tried it, and it gives me LTE and Test.

like image 34
kol Avatar answered Oct 18 '22 15:10

kol