Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regular Expressions: Capture lookahead value (capturing text without consuming it)

I wish to use regular expressions to split words into groups of (vowels, not_vowels, more_vowels), using a marker to ensure every word begins and ends with a vowel.

import re

MARKER = "~"
VOWELS = {"a", "e", "i", "o", "u", MARKER}

word = "dog"

if word[0] not in VOWELS:
    word = MARKER+word

if word[-1] not in VOWELS:
    word += MARKER

re.findall("([%]+)([^%]+)([%]+)".replace("%", "".join(VOWELS)), word)

In this example we get:

[('~', 'd', 'o')]

The issue is that I wish the matches to overlap - the last set of vowels should become the first set of the next match. This appears possible with lookaheads, if we replace the regex as follows:

re.findall("([%]+)([^%]+)(?=[%]+)".replace("%", "".join(VOWELS)), word)

We get:

[('~', 'd'), ('o', 'g')]

Which means we are matching what I want. However, it now doesn't return the last set of vowels. The output I want is:

[('~', 'd', 'o'), ('o', 'g', '~')]

I feel this should be possible (if the regex can check for the second set of vowels, I see no reason it can't return them), but I can't find any way of doing it beyond the brute force method, looping through the results after I have them and appending the first character of the next match to the last match, and the last character of the string to the last match. Is there a better way in which I can do this?

The two things that would work would be capturing the lookahead value, or not consuming the text on a match, while capturing the value - I can't find any way of doing either.

like image 366
Gareth Latty Avatar asked Dec 12 '22 03:12

Gareth Latty


2 Answers

I found it just after posting:

re.findall("([%]+)([^%]+)(?=([%]+))".replace("%", "".join(VOWELS)), word)

Adding an extra pair of brackets inside the lookahead means that it becomes a capture itself.

I found this pretty obscure and hard to find - I'm not sure if it's just everyone else found this obvious, but hopefully anyone else in my position will find this more easily in future.

like image 125
Gareth Latty Avatar answered Dec 14 '22 18:12

Gareth Latty


I would not try to make the regex engine do this; I would split the string into consonant and vowel chunks, and then generate the overlapping results. This way, you also don't actually need to hack in markers, assuming you're okay with '' as the "vowel" part when the word doesn't actually being or end with a vowel.

def overlapping_matches(word):
    pieces = re.split('([^aeiou]+)', word)
    # There are other ways to do this; I'm kinda showing off
    return zip(pieces[:-2], pieces[1:-1], pieces[2:])[::2]

overlapping_matches('dog') # [('', 'd', 'o'), ('o', 'g', '')]

(This still fails if word contains only vowels, but that is trivially corrected if necessary.)

like image 36
Karl Knechtel Avatar answered Dec 14 '22 17:12

Karl Knechtel