Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistency between sed and python regular expressions

Tags:

python

regex

sed

I apologize if this is published somewhere, but my cursory searching didn't find anything.

While doing some Python programming I noticed that the following command:

re.sub("a*((ab)*)b", r"\1", "aabb")

returns the empty string. But an equivalent command in sed:

echo "aabb" | sed "s/a*\(\(ab\)*\)b/\1/"

returns ab.

It makes sense to me that the "a*" directive at the beginning of the python regex would match both a's, causing "(ab)*" to match zero times, but I have no idea how sed comes up with ab. Does anybody know what the difference is between the two regex engines that causes this? I believe they both match stars greedily by default, but it occurred to me that sed might match from the right rather than the left. Any insight would be greatly appreciated.

like image 737
maths Avatar asked Aug 23 '12 21:08

maths


People also ask

Does sed support regular expressions?

A regular expression is a string that can be used to describe several sequences of characters. Regular expressions are used by several different Unix commands, including ed, sed, awk, grep, and to a more limited extent, vi.

What type of regex does sed use?

As Avinash Raj has pointed out, sed uses basic regular expression (BRE) syntax by default, (which requires ( , ) , { , } to be preceded by \ to activate its special meaning), and -r option switches over to extended regular expression (ERE) syntax, which treats ( , ) , { , } as special without preceding \ .

Is there any difference between re match () and re search () in the Python re module?

There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found.


1 Answers

Both Python and sed are greedy by default but... Python regex tries to evaluate from left to right in all circumstances, despite of it must do eventually a backtrace to the previous state if the branch being tried can not continue by matching. Sed regex on the contrary are optimized before evaluating in order to prevent an unnecessary backtrace, by rewriting the regex to a more deterministic form. Therefore the combined optional pattern "aab" is probably tested before the plain "a" because the most specific possible string is tried first.

Python pattern matches the string "aabb" twice as "aab" + "b" (marked between "<>")

>>> re.sub("a*((ab)*)b", r"<\1>", "aabb")
'<><>'

while sed matches the whole "aabb" by one substitution:

$ echo "aabb" | sed "s/a*\(\(ab\)*\)b/<\1>/"
<ab>

Python regex backtrace algorithm is explained good in regex howto - Repeating Things in two paragraphs introduced by words "A step-by-step example...". It does IMO exactly what is described regex docs: "As the target string is scanned, REs separated by '|' are tried from left to right."

Demonstration

The order of "(|a|aa)" btw. "(aa|a|)" is respected by Python

>>> re.sub("(?:|a|aa)((ab)*)b", r"<\1>", "aabb")
'<ab>'
>>> re.sub("(?:aa|a|)((ab)*)b", r"<\1>", "aabb")
'<><>'

but this order is ignored by sed because sed optimizes regular expressions. Matching "aab" + "b" can be reproduced removing "a" option from the pattern.

$ echo "aabb" | sed "s/\(\|a\|aa\)\(\(ab\)*\)b/<\2>/g"
<ab>
$ echo "aabb" | sed "s/\(aa\|a\|\)\(\(ab\)*\)b/<\2>/g"
<ab>
$ echo "aabb" | sed "s/\(aa\|\)\(\(ab\)*\)b/<\2>/g"
<><>

Edit: I removed everything about DFA/NFA because I can not prove it from current texts.

like image 60
hynekcer Avatar answered Sep 23 '22 22:09

hynekcer