Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex: Alternation for sets of words

Tags:

python

regex

We know \ba\b|\bthe\b will match either word "a" or "the"
I want to build a regex expression to match a pattern like

a/the/one reason/reasons for/of

Which means I want to match a string s contains 3 words:

  • the first word of s should be "a", "the" or "one"
  • the second word should be "reason" or "reasons"
  • the third word of s should be "for" or "of"

The regex \ba\b|\bthe\b|\bone\b \breason\b|reasons\b \bfor\b|\bof\b doesn't help.

How can I do this? BTW, I use python. Thanks.

like image 448
user1903382 Avatar asked Dec 02 '22 14:12

user1903382


2 Answers

You need to use a capture group to refuse of mixing the OR's (|)

(\ba\b|\bthe\b|\bone\b) (\breason\b|reasons\b) (\bfor\b|\bof\b)

And then as a more elegant way you can put the word boundaries around the groups.Also note that when you are using space in your regex around the words there is no need to use word boundary.And for reasons and reason you can make the last s optional with ?. And note that if you don't want to match your words as a separate groups you can makes your groups to a none capture group by :?.

\b(?:a|the|one) reasons? (?:for|of)\b

Or use capture group if you want the words in group :

\b(a|the|one) (reasons?) (for|of)\b
like image 120
Mazdak Avatar answered Dec 06 '22 10:12

Mazdak


The regular expression modifier A|B means that "if either A or B matches, then the whole thing matches". So in your case, the resulting regular expression matches if/where any of the following 5 regular expressions match:

  • \ba\b
  • \bthe\b
  • \bone\b \breason\b
  • reasons\b \bfor\b
  • \bof\b

To limit the extent to which | applies, use the non-capturing grouping for this, that is (?:something|something else). Also, for having an optional s at the end of reason you do not need to use alteration; this is exactly equal to reasons?.

Thus we get the regular expression \b(?:a|the|one) reasons? (?:for|of)\b.

Note that you do not need to use the word boundary operators \b within the regular expression, only at the beginning and end (otherwise it would match something like everyone reasons forever).