Why does re.sub('.*?', '-', 'abc') return '-a-b-c-' instead of '-------'?

Question

This is the results from python2.7.

>>> re.sub('.*?', '-', 'abc')
'-a-b-c-'

The results I thought should be as follows.

>>> re.sub('.*?', '-', 'abc')
'-------'

But it's not. Why?

Veedrac · Accepted Answer

The best explanation of this behaviour I know of is from the regex PyPI package, which is intended to eventually replace re (although it has been this way for a long time now).

Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?

Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.

Examples:
# Version 0 behaviour (like re)
>>> regex.sub('(?V0).*', 'x', 'test')
'x'
>>> regex.sub('(?V0).*?', '|', 'test')
'|t|e|s|t|'

# Version 1 behaviour (like Perl)
>>> regex.sub('(?V1).*', 'x', 'test')
'xx'
>>> regex.sub('(?V1).*?', '|', 'test')
'|||||||||'

(?VX) sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split).

Why does re.sub('.*?', '-', 'abc') return '-a-b-c-' instead of '-------'?

Tags:

python

regex

Daniel

1 Answers

Veedrac

Recent Activity

Donate For Us

Why does re.sub('.*?', '-', 'abc') return '-a-b-c-' instead of '-------'?

Tags:

python

regex

Daniel

1 Answers

Veedrac

Related questions

Recent Activity

Donate For Us