Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regex.sub() gives different results to re.sub()

I work with Czech accented text in Python 3.4.

Calling re.sub() to perform substitution by regex on an accented sentence works well, but using a regex compiled with re.compile() and then calling regex.sub() fails.

Here is the case, where I use the same arguments for re.sub() and regex.sub()

import re

pattern = r'(?<!\*)(Poplatn[ií]\w+ da[nň]\w+)'
flags = re.I|re.L
compiled = re.compile(pattern, flags)
text = 'Poplatníkem daně z pozemků je vlastník pozemku'
mark = r'**\1**' # wrap 1st matching group in double stars

print(re.sub(pattern, mark, text, flags))
# outputs: **Poplatníkem daně** z pozemků je vlastník pozemku
# substitution works

print(compiled.sub(mark, text))
# outputs: Poplatníkem daně z pozemků je vlastník pozemku
# substitution fails

I believe that the reason is accents, because for a non-accented sentence re.sub() and regex.sub() work identically.

But it seems to me like a bug, because passing the same arguments returns different results, which should not happen. This topic is complicated by different platforms and locales, so it may not be reproducible on your system. Here is screenshot of my console.

Python console

Do you see any fault in my code, or should I report it as a bug?

like image 764
Rudolf Gröhling Avatar asked Apr 13 '15 09:04

Rudolf Gröhling


People also ask

What does re sub () do?

The re. sub() function stands for a substring and returns a string with replaced values. Multiple elements can be replaced using a list when we use this function.

How do you replace re subs?

If you want to replace a string that matches a regular expression (regex) instead of perfect match, use the sub() of the re module. In re. sub() , specify a regex pattern in the first argument, a new string in the second, and a string to be processed in the third.


2 Answers

As Padraic Cunningham figured out, this is not actually a bug.

However, it is related to a bug which you didn't run into, and to you using a flag you probably shouldn't be using, so I'll leave my earlier answer below, even though his is the right answer to your problem.


There's a recent-ish change (somewhere between 3.4.1 and 3.4.3, and between 2.7.3 and 2.7.8) that affects this. Before that change, you can't even compile that pattern without raising an OverflowError.

More importantly, why are you using re.L? The re.L mechanism does not mean "use the Unicode rules for my locale", it means "use some unspecified non-Unicode rules that only really make sense for Latin-1-derived locales and may not work right on Windows". Or, as the docs put it:

Make \w, \W, \b, \B, \s and \S dependent on the current locale. The use of this flag is discouraged as the locale mechanism is very unreliable, and it only handles one “culture” at a time anyway; you should use Unicode matching instead, which is the default in Python 3 for Unicode (str) patterns.

See bug #22407 and the linked python-dev thread for some recent discussion of this.

And if I remove the re.L flag, the code now compiles just fine on 3.4.1. (I also get the "right" results on both 3.4.1 and 3.4.3, but that's just a coincidence; I'm now intentionally not passing the screwy flag and screwing it up in the first version, and still accidentally not passing the screwy flag and screwing it up in the second, so they match…)

So, even if this were a bug, there's a good chance it would be closed WONTFIX. The resolution for #22407 was to deprecate re.L for non-bytes patterns in 3.5 and remove it in 3.6, so I doubt anyone's going to care about fixing bugs with it now. (Not to mention that re itself is theoretically going away in favor of regex one of these decades… and IIRC, regex also deprecated the L flag unless you're using a bytes pattern and re-compatible mode.)

like image 194
abarnert Avatar answered Sep 21 '22 11:09

abarnert


The last argument in the compile is flags, if you actually use flags=flags in the re.sub you will see the same behaviour:

compiled = re.compile(pattern, flags)
print(compiled)
text = 'Poplatníkem daně z pozemků je vlastník pozemku'
mark = r'**\1**' # wrap 1st matching group in double stars

r = re.sub(pattern, mark, text, flags=flags)

The fourth arg to re.sub is count so that is why you see the difference.

re.sub(pattern, repl, string, count=0, flags=0)

re.compile(pattern, flags=0)

like image 27
Padraic Cunningham Avatar answered Sep 24 '22 11:09

Padraic Cunningham