longest first
>>> p = re.compile('supermanutd|supermanu|superman|superm|super')
shortest first
>>> p = re.compile('super|superm|superman|supermanu|supermanutd')
Why is the longest first regex preferred?
Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.
You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:
supermanutd
supermanu
superman
superm
then with your first Rx you'll get:
>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']
but with second Rx:
>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']
Test your regexes with http://www.pythonregex.com/
As @MBO says, alternatives are tested in the order they are written, and once one of them matches, the RE engine goes on to what comes after.
This behaviour is common to Perl-like RE engines, and ultimately goes back to the 1985 Bell Labs design of the RE library for Edition 8 Unix.
Note that POSIX 2 (from 1991) has another definition, insisting on the leftmost longest match for the whole RE and subject to that, for each subexpression in turn (in lexical order). In POSIX 2, order of alternatives does not matter.
However, the difference in behaviour is often: irrelevant (if you're just testing), masked by backtracking (if the shorter match causes the rest of the RE to fail), or compensated by the rest of the RE matching the part that the longer match 'should have' -- so most people aren't aware of it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With