I work with Czech accented text in Python 3.4.
Calling re.sub()
to perform substitution by regex on an accented sentence works well, but using a regex compiled with re.compile()
and then calling regex.sub()
fails.
Here is the case, where I use the same arguments for re.sub()
and regex.sub()
import re
pattern = r'(?<!\*)(Poplatn[ií]\w+ da[nň]\w+)'
flags = re.I|re.L
compiled = re.compile(pattern, flags)
text = 'Poplatníkem daně z pozemků je vlastník pozemku'
mark = r'**\1**' # wrap 1st matching group in double stars
print(re.sub(pattern, mark, text, flags))
# outputs: **Poplatníkem daně** z pozemků je vlastník pozemku
# substitution works
print(compiled.sub(mark, text))
# outputs: Poplatníkem daně z pozemků je vlastník pozemku
# substitution fails
I believe that the reason is accents, because for a non-accented sentence re.sub()
and regex.sub()
work identically.
But it seems to me like a bug, because passing the same arguments returns different results, which should not happen. This topic is complicated by different platforms and locales, so it may not be reproducible on your system. Here is screenshot of my console.
Do you see any fault in my code, or should I report it as a bug?
The re. sub() function stands for a substring and returns a string with replaced values. Multiple elements can be replaced using a list when we use this function.
If you want to replace a string that matches a regular expression (regex) instead of perfect match, use the sub() of the re module. In re. sub() , specify a regex pattern in the first argument, a new string in the second, and a string to be processed in the third.
As Padraic Cunningham figured out, this is not actually a bug.
However, it is related to a bug which you didn't run into, and to you using a flag you probably shouldn't be using, so I'll leave my earlier answer below, even though his is the right answer to your problem.
There's a recent-ish change (somewhere between 3.4.1 and 3.4.3, and between 2.7.3 and 2.7.8) that affects this. Before that change, you can't even compile that pattern without raising an OverflowError
.
More importantly, why are you using re.L
? The re.L
mechanism does not mean "use the Unicode rules for my locale", it means "use some unspecified non-Unicode rules that only really make sense for Latin-1-derived locales and may not work right on Windows". Or, as the docs put it:
Make
\w
,\W
,\b
,\B
,\s
and\S
dependent on the current locale. The use of this flag is discouraged as the locale mechanism is very unreliable, and it only handles one “culture” at a time anyway; you should use Unicode matching instead, which is the default in Python 3 for Unicode (str) patterns.
See bug #22407 and the linked python-dev thread for some recent discussion of this.
And if I remove the re.L
flag, the code now compiles just fine on 3.4.1. (I also get the "right" results on both 3.4.1 and 3.4.3, but that's just a coincidence; I'm now intentionally not passing the screwy flag and screwing it up in the first version, and still accidentally not passing the screwy flag and screwing it up in the second, so they match…)
So, even if this were a bug, there's a good chance it would be closed WONTFIX. The resolution for #22407 was to deprecate re.L
for non-bytes
patterns in 3.5 and remove it in 3.6, so I doubt anyone's going to care about fixing bugs with it now. (Not to mention that re
itself is theoretically going away in favor of regex
one of these decades… and IIRC, regex
also deprecated the L
flag unless you're using a bytes
pattern and re
-compatible mode.)
The last argument in the compile is flags
, if you actually use flags=flags
in the re.sub
you will see the same behaviour:
compiled = re.compile(pattern, flags)
print(compiled)
text = 'Poplatníkem daně z pozemků je vlastník pozemku'
mark = r'**\1**' # wrap 1st matching group in double stars
r = re.sub(pattern, mark, text, flags=flags)
The fourth arg to re.sub
is count
so that is why you see the difference.
re.sub(pattern, repl, string, count=0, flags=0)
re.compile(pattern, flags=0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With