I apologize if this is published somewhere, but my cursory searching didn't find anything.
While doing some Python programming I noticed that the following command:
re.sub("a*((ab)*)b", r"\1", "aabb")
returns the empty string. But an equivalent command in sed:
echo "aabb" | sed "s/a*\(\(ab\)*\)b/\1/"
returns ab
.
It makes sense to me that the "a*" directive at the beginning of the python regex would match both a
's, causing "(ab)*" to match zero times, but I have no idea how sed comes up with ab
. Does anybody know what the difference is between the two regex engines that causes this? I believe they both match stars greedily by default, but it occurred to me that sed might match from the right rather than the left. Any insight would be greatly appreciated.
A regular expression is a string that can be used to describe several sequences of characters. Regular expressions are used by several different Unix commands, including ed, sed, awk, grep, and to a more limited extent, vi.
As Avinash Raj has pointed out, sed uses basic regular expression (BRE) syntax by default, (which requires ( , ) , { , } to be preceded by \ to activate its special meaning), and -r option switches over to extended regular expression (ERE) syntax, which treats ( , ) , { , } as special without preceding \ .
There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found.
Both Python and sed are greedy by default but... Python regex tries to evaluate from left to right in all circumstances, despite of it must do eventually a backtrace to the previous state if the branch being tried can not continue by matching. Sed regex on the contrary are optimized before evaluating in order to prevent an unnecessary backtrace, by rewriting the regex to a more deterministic form. Therefore the combined optional pattern "aab" is probably tested before the plain "a" because the most specific possible string is tried first.
Python pattern matches the string "aabb" twice as "aab" + "b" (marked between "<>")
>>> re.sub("a*((ab)*)b", r"<\1>", "aabb")
'<><>'
while sed matches the whole "aabb" by one substitution:
$ echo "aabb" | sed "s/a*\(\(ab\)*\)b/<\1>/"
<ab>
Python regex backtrace algorithm is explained good in regex howto - Repeating Things in two paragraphs introduced by words "A step-by-step example...". It does IMO exactly what is described regex docs: "As the target string is scanned, REs separated by '|' are tried from left to right."
Demonstration
The order of "(|a|aa)" btw. "(aa|a|)" is respected by Python
>>> re.sub("(?:|a|aa)((ab)*)b", r"<\1>", "aabb")
'<ab>'
>>> re.sub("(?:aa|a|)((ab)*)b", r"<\1>", "aabb")
'<><>'
but this order is ignored by sed because sed optimizes regular expressions. Matching "aab" + "b" can be reproduced removing "a" option from the pattern.
$ echo "aabb" | sed "s/\(\|a\|aa\)\(\(ab\)*\)b/<\2>/g"
<ab>
$ echo "aabb" | sed "s/\(aa\|a\|\)\(\(ab\)*\)b/<\2>/g"
<ab>
$ echo "aabb" | sed "s/\(aa\|\)\(\(ab\)*\)b/<\2>/g"
<><>
Edit: I removed everything about DFA/NFA because I can not prove it from current texts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With