Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the reason behind the advice that the substrings in regex should be ordered based on length?

Tags:

python

regex

longest first

>>> p = re.compile('supermanutd|supermanu|superman|superm|super')

shortest first

>>> p = re.compile('super|superm|superman|supermanu|supermanutd')

Why is the longest first regex preferred?

like image 443
Frankie Ribery Avatar asked Apr 26 '11 08:04

Frankie Ribery


2 Answers

Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.

You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:

supermanutd
supermanu
superman
superm

then with your first Rx you'll get:

>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']

but with second Rx:

>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']

Test your regexes with http://www.pythonregex.com/

like image 190
MBO Avatar answered Oct 06 '22 01:10

MBO


As @MBO says, alternatives are tested in the order they are written, and once one of them matches, the RE engine goes on to what comes after.
This behaviour is common to Perl-like RE engines, and ultimately goes back to the 1985 Bell Labs design of the RE library for Edition 8 Unix.
Note that POSIX 2 (from 1991) has another definition, insisting on the leftmost longest match for the whole RE and subject to that, for each subexpression in turn (in lexical order). In POSIX 2, order of alternatives does not matter.

However, the difference in behaviour is often: irrelevant (if you're just testing), masked by backtracking (if the shorter match causes the rest of the RE to fail), or compensated by the rest of the RE matching the part that the longer match 'should have' -- so most people aren't aware of it.

like image 20
LHMathies Avatar answered Oct 06 '22 00:10

LHMathies