I wish to use regular expressions to split words into groups of (vowels, not_vowels, more_vowels)
, using a marker to ensure every word begins and ends with a vowel.
import re
MARKER = "~"
VOWELS = {"a", "e", "i", "o", "u", MARKER}
word = "dog"
if word[0] not in VOWELS:
word = MARKER+word
if word[-1] not in VOWELS:
word += MARKER
re.findall("([%]+)([^%]+)([%]+)".replace("%", "".join(VOWELS)), word)
In this example we get:
[('~', 'd', 'o')]
The issue is that I wish the matches to overlap - the last set of vowels should become the first set of the next match. This appears possible with lookaheads, if we replace the regex as follows:
re.findall("([%]+)([^%]+)(?=[%]+)".replace("%", "".join(VOWELS)), word)
We get:
[('~', 'd'), ('o', 'g')]
Which means we are matching what I want. However, it now doesn't return the last set of vowels. The output I want is:
[('~', 'd', 'o'), ('o', 'g', '~')]
I feel this should be possible (if the regex can check for the second set of vowels, I see no reason it can't return them), but I can't find any way of doing it beyond the brute force method, looping through the results after I have them and appending the first character of the next match to the last match, and the last character of the string to the last match. Is there a better way in which I can do this?
The two things that would work would be capturing the lookahead value, or not consuming the text on a match, while capturing the value - I can't find any way of doing either.
I found it just after posting:
re.findall("([%]+)([^%]+)(?=([%]+))".replace("%", "".join(VOWELS)), word)
Adding an extra pair of brackets inside the lookahead means that it becomes a capture itself.
I found this pretty obscure and hard to find - I'm not sure if it's just everyone else found this obvious, but hopefully anyone else in my position will find this more easily in future.
I would not try to make the regex engine do this; I would split the string into consonant and vowel chunks, and then generate the overlapping results. This way, you also don't actually need to hack in markers, assuming you're okay with ''
as the "vowel" part when the word doesn't actually being or end with a vowel.
def overlapping_matches(word):
pieces = re.split('([^aeiou]+)', word)
# There are other ways to do this; I'm kinda showing off
return zip(pieces[:-2], pieces[1:-1], pieces[2:])[::2]
overlapping_matches('dog') # [('', 'd', 'o'), ('o', 'g', '')]
(This still fails if word
contains only vowels, but that is trivially corrected if necessary.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With