Inconsistency between sed and python regular expressions

Tags:

I apologize if this is published somewhere, but my cursory searching didn't find anything.

While doing some Python programming I noticed that the following command:

re.sub("a*((ab)*)b", r"\1", "aabb")

returns the empty string. But an equivalent command in sed:

echo "aabb" | sed "s/a*\(\(ab\)*\)b/\1/"

returns ab.

It makes sense to me that the "a*" directive at the beginning of the python regex would match both a's, causing "(ab)*" to match zero times, but I have no idea how sed comes up with ab. Does anybody know what the difference is between the two regex engines that causes this? I believe they both match stars greedily by default, but it occurred to me that sed might match from the right rather than the left. Any insight would be greatly appreciated.

737

asked Aug 23 '12 21:08

maths

1 Answers

Both Python and sed are greedy by default but... Python regex tries to evaluate from left to right in all circumstances, despite of it must do eventually a backtrace to the previous state if the branch being tried can not continue by matching. Sed regex on the contrary are optimized before evaluating in order to prevent an unnecessary backtrace, by rewriting the regex to a more deterministic form. Therefore the combined optional pattern "aab" is probably tested before the plain "a" because the most specific possible string is tried first.

Python pattern matches the string "aabb" twice as "aab" + "b" (marked between "<>")

>>> re.sub("a*((ab)*)b", r"<\1>", "aabb")
'<><>'

while sed matches the whole "aabb" by one substitution:

$ echo "aabb" | sed "s/a*\(\(ab\)*\)b/<\1>/"
<ab>

Python regex backtrace algorithm is explained good in regex howto - Repeating Things in two paragraphs introduced by words "A step-by-step example...". It does IMO exactly what is described regex docs: "As the target string is scanned, REs separated by '|' are tried from left to right."

Demonstration

The order of "(|a|aa)" btw. "(aa|a|)" is respected by Python

>>> re.sub("(?:|a|aa)((ab)*)b", r"<\1>", "aabb")
'<ab>'
>>> re.sub("(?:aa|a|)((ab)*)b", r"<\1>", "aabb")
'<><>'

but this order is ignored by sed because sed optimizes regular expressions. Matching "aab" + "b" can be reproduced removing "a" option from the pattern.

$ echo "aabb" | sed "s/\(\|a\|aa\)\(\(ab\)*\)b/<\2>/g"
<ab>
$ echo "aabb" | sed "s/\(aa\|a\|\)\(\(ab\)*\)b/<\2>/g"
<ab>
$ echo "aabb" | sed "s/\(aa\|\)\(\(ab\)*\)b/<\2>/g"
<><>

Edit: I removed everything about DFA/NFA because I can not prove it from current texts.

answered Sep 23 '22 22:09

hynekcer

Related questions
                            
                                Improve contour detection with OpenCV (Python)
                            
                                Can functions know if they are already multiprocessed in Python (joblib)
                            
                                Is there a clean way to suppress compiler warnings from Cython when using pyximport.install?
                            
                                Slow equality evaluation for identical objects (x == x)
                            
                                Tensorflow Windows Accessing Folders Denied:"NewRandomAccessFile failed to Create/Open: Access is denied. ; Input/output error"
                            
                                execinfo.h missing when installing xgboost in Cygwin
                            
                                Memory leak with TensorFlow
                            
                                Drawing regression line, confidence interval, and prediction interval in Python
                            
                                Faster way to accomplish this Pandas job than by using Apply for large data set?
                            
                                Django formset , queries for relational field for every form
                            
                                from where SSL ConnectionResetError comes from?
                            
                                Anaconda Navigator does not update packages
                            
                                Import python module in flutter using starflut
                            
                                How do I print to the OS's default printer in Python 3 (cross platform)?
                            
                                Dataflow computing in python
                            
                                from <module> import ... in __init__.py makes module name visible?
                            
                                Urllib and validation of server certificate
                            
                                Matplotlib: Label points on mouseover
                            
                                Rolling out a web authentication system
                            
                                Is threre a RoboCode like Game or Challenge for Python? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Inconsistency between sed and python regular expressions

Tags:

python

regex

sed

maths

People also ask

1 Answers

hynekcer

Recent Activity

Donate For Us