This is sort of a contrived example, but I'm trying to get at a general principle here.
Given phrases written in English using this list-like form:
I have a cat
I have a cat and a dog
I have a cat, a dog, and a guinea pig
I have a cat, a dog, a guinea pig, and a snake
Can I use a regular expression to get all of the items, regardless of how many there are? Note that the items may contain multiple words.
Obviously if I have just one, then I can use I have a (.+)
, and if there are exactly two, I have a (.+) and a (.+)
works.
But things get more complicated if I want to match more than just one example. If I want to extract the list items from the first two examples, I would think this would work: I have a (.*)(?: and a (.*))?
And while this works on the first phrase, telling me I have a cat
and null
, for the second one it tells me I have a cat and a dog
and null
. Things only get worse when I try to match phrases in even more forms.
Is there any way I can use regexes for this purpose? It seems rather simple, and I don't understand why my regex that matches 2-item lists works, but the one that matches 1- or 2-item lists does not.
You can use a non-capturing group as a conditional delimiter (either a comma or end-of-line): ' a (.*?)(?:,|$)'
Example in python:
import re
line = 'I have a cat, a dog, a guinea pig, and a snake'
mat = re.findall(r' a (.*?)(?:,|$)', line)
print mat # ['cat', 'dog', 'guinea pig', 'snake']
I use regex splitting to do it. But this assumes sentence format exactly matching your input set:
>>> SPLIT_REGEX = r', |I have|and|, and'
>>> for sample in ('I have a cat', 'I have a cat and a dog', 'I have a cat, a dog, and a guinea pig', 'I have a cat, a dog, a guinea pig, and a snake'):
... print [x.strip() for x in re.split(SPLIT_REGEX, sample) if x.strip()]
...
['a cat']
['a cat', 'a dog']
['a cat', 'a dog', 'a guinea pig']
['a cat', 'a dog', 'a guinea pig', 'a snake']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With