What is the reason behind the advice that the substrings in regex should be ordered based on length?

Question

longest first

>>> p = re.compile('supermanutd|supermanu|superman|superm|super')

shortest first

>>> p = re.compile('super|superm|superman|supermanu|supermanutd')

Why is the longest first regex preferred?

MBO · Accepted Answer

Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.

You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:

supermanutd
supermanu
superman
superm

then with your first Rx you'll get:

>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']

but with second Rx:

>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']

Test your regexes with http://www.pythonregex.com/

LHMathies · Answer

As @MBO says, alternatives are tested in the order they are written, and once one of them matches, the RE engine goes on to what comes after.
This behaviour is common to Perl-like RE engines, and ultimately goes back to the 1985 Bell Labs design of the RE library for Edition 8 Unix.
Note that POSIX 2 (from 1991) has another definition, insisting on the leftmost longest match for the whole RE and subject to that, for each subexpression in turn (in lexical order). In POSIX 2, order of alternatives does not matter.

However, the difference in behaviour is often: irrelevant (if you're just testing), masked by backtracking (if the shorter match causes the rest of the RE to fail), or compensated by the rest of the RE matching the part that the longer match 'should have' -- so most people aren't aware of it.

What is the reason behind the advice that the substrings in regex should be ordered based on length?

Tags:

python

regex

Frankie Ribery

2 Answers

MBO

LHMathies

Recent Activity

Donate For Us

What is the reason behind the advice that the substrings in regex should be ordered based on length?

Tags:

python

regex

Frankie Ribery

2 Answers

MBO

LHMathies

Related questions

Recent Activity

Donate For Us