Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing multiple names - Lookbehind in the middle of regex doesn't work

I am having trouble getting this regex to work and none of the canned ones I have found work reliably.

The desired result:

Produce the following via regex matches:

"Person One"
"Person Two"
"Person Three"

Out of these example lines:

By Person One, Person Two and Person Three
By Person One, Person Two
By Person One
By Person Two and Person Three

Here is what I have and note, if you break off the sections, I get partial matches but something with the lookbehind is throwing it off. Also, if there is a better way simpler but still reliable to pull all the "Persons" regardless of whether one, two, or three with an "and" is provided. It does not have to support more than the three but I would think as long as the "and" trails last the # of "Persons" can certainly remain variable without impacting the regex.

Saved current attempt (matches one but if you split my and lookbehind and run it then it does match all the "and" lines:

(?<=by )((\w+) (\w+))(?:,\s*)?((\w+) (\w+))?(?:\s*(?<=and ))((\w+) (\w+))

https://regex101.com/r/z3Y9TQ/1

like image 588
Collin Chaffin Avatar asked May 11 '18 00:05

Collin Chaffin


2 Answers

Instead of using Lookbehind to check for and you can use a non-capturing group like what you did with the comma:

(?<=by )(\w+ \w+)(?:,\s*)?(\w+ \w+)?(?:\sand\s)?(\w+ \w+)?

Note that you don't need to add each \w+ in a group.

Try it online.


Lookbehind in the middle of regex:

The reason why Lookbehind won't work in this case is that you have it in the middle of your regex pattern. This is not how Lookbehind works. The matching starts from the beginning until it reaches the Lookbehind (e.g., (?<=prior)subsequent), it matches what comes after it (i.e., subsequent), then and only then it "looks behind" expecting to find prior. So basically what comes before the Lookbehind must be followed by what's after the (?<=) (i.e., subsequent), but at the same time, what comes after the Lookbehind must be preceded by what's inside it (i.e., prior). See where the problem comes from?

Therefore, in your example, the only way to match the full sentence with the Lookbehind in the middle is to also include the and in the pattern which makes the Lookbehind redundant.

To illustrate, take a look at this demo. As you can see, the pattern ((?<=and )Person matches Person when it comes after and. Now let's change it to Two (?<=and )Person, you'd probably think it'll work, but it actually finds no matches and that's because it first looks for Two, then it looks for Person, but it doesn't find it (because "Person" doesn't immediately follow "Two ") so it doesn't proceed to the next step which is the Lookbehind.

The only way to make the Lookbehind work in this case, is to also include the and right after the Two like this: Two and (?<=and )Person, which makes the Lookbehind redundant as explained above.

A similar behavior, but for Lookaheads (i.e., when Lookahead comes in the middle) is very well explained in this awesome answer be revo.

Hope that helps.

like image 90
41686d6564 stands w. Palestine Avatar answered Oct 22 '22 08:10

41686d6564 stands w. Palestine


I can't seem to get the lookbehind for and working, but this works with a non-capturing group:

(?<=by )(\w+ \w+)(?:, *)?(\w+ \w+)?(?: *)(?:and (\w+ \w+))?

I changed \s to space in the regexp so it won't match the newlines.

DEMO

like image 33
Barmar Avatar answered Oct 22 '22 09:10

Barmar