Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for items listed in plain english

Tags:

java

regex

This is sort of a contrived example, but I'm trying to get at a general principle here.

Given phrases written in English using this list-like form:

I have a cat
I have a cat and a dog
I have a cat, a dog, and a guinea pig
I have a cat, a dog, a guinea pig, and a snake

Can I use a regular expression to get all of the items, regardless of how many there are? Note that the items may contain multiple words.

Obviously if I have just one, then I can use I have a (.+), and if there are exactly two, I have a (.+) and a (.+) works.

But things get more complicated if I want to match more than just one example. If I want to extract the list items from the first two examples, I would think this would work: I have a (.*)(?: and a (.*))? And while this works on the first phrase, telling me I have a cat and null, for the second one it tells me I have a cat and a dog and null. Things only get worse when I try to match phrases in even more forms.

Is there any way I can use regexes for this purpose? It seems rather simple, and I don't understand why my regex that matches 2-item lists works, but the one that matches 1- or 2-item lists does not.

like image 302
codebreaker Avatar asked Aug 01 '14 18:08

codebreaker


2 Answers

You can use a non-capturing group as a conditional delimiter (either a comma or end-of-line):
' a (.*?)(?:,|$)'

Example in python:

import re
line = 'I have a cat, a dog, a guinea pig, and a snake'
mat = re.findall(r' a (.*?)(?:,|$)', line)
print mat # ['cat', 'dog', 'guinea pig', 'snake']
like image 107
Nir Alfasi Avatar answered Sep 28 '22 19:09

Nir Alfasi


I use regex splitting to do it. But this assumes sentence format exactly matching your input set:

>>> SPLIT_REGEX = r', |I have|and|, and'
>>> for sample in ('I have a cat', 'I have a cat and a dog', 'I have a cat, a dog, and a guinea pig', 'I have a cat, a dog, a guinea pig, and a snake'):
...     print [x.strip() for x in re.split(SPLIT_REGEX, sample) if x.strip()]
... 
['a cat']
['a cat', 'a dog']
['a cat', 'a dog', 'a guinea pig']
['a cat', 'a dog', 'a guinea pig', 'a snake']
like image 37
Santa Avatar answered Sep 28 '22 18:09

Santa