regex.sub() gives different results to re.sub()

Tags:

I work with Czech accented text in Python 3.4.

Calling re.sub() to perform substitution by regex on an accented sentence works well, but using a regex compiled with re.compile() and then calling regex.sub() fails.

Here is the case, where I use the same arguments for re.sub() and regex.sub()

import re

pattern = r'(?<!\*)(Poplatn[ií]\w+ da[nň]\w+)'
flags = re.I|re.L
compiled = re.compile(pattern, flags)
text = 'Poplatníkem daně z pozemků je vlastník pozemku'
mark = r'**\1**' # wrap 1st matching group in double stars

print(re.sub(pattern, mark, text, flags))
# outputs: **Poplatníkem daně** z pozemků je vlastník pozemku
# substitution works

print(compiled.sub(mark, text))
# outputs: Poplatníkem daně z pozemků je vlastník pozemku
# substitution fails

I believe that the reason is accents, because for a non-accented sentence re.sub() and regex.sub() work identically.

But it seems to me like a bug, because passing the same arguments returns different results, which should not happen. This topic is complicated by different platforms and locales, so it may not be reproducible on your system. Here is screenshot of my console.

Python console

Do you see any fault in my code, or should I report it as a bug?

764

asked Apr 13 '15 09:04

Rudolf Gröhling

2 Answers

As Padraic Cunningham figured out, this is not actually a bug.

However, it is related to a bug which you didn't run into, and to you using a flag you probably shouldn't be using, so I'll leave my earlier answer below, even though his is the right answer to your problem.

There's a recent-ish change (somewhere between 3.4.1 and 3.4.3, and between 2.7.3 and 2.7.8) that affects this. Before that change, you can't even compile that pattern without raising an OverflowError.

More importantly, why are you using re.L? The re.L mechanism does not mean "use the Unicode rules for my locale", it means "use some unspecified non-Unicode rules that only really make sense for Latin-1-derived locales and may not work right on Windows". Or, as the docs put it:

Make \w, \W, \b, \B, \s and \S dependent on the current locale. The use of this flag is discouraged as the locale mechanism is very unreliable, and it only handles one “culture” at a time anyway; you should use Unicode matching instead, which is the default in Python 3 for Unicode (str) patterns.

See bug #22407 and the linked python-dev thread for some recent discussion of this.

And if I remove the re.L flag, the code now compiles just fine on 3.4.1. (I also get the "right" results on both 3.4.1 and 3.4.3, but that's just a coincidence; I'm now intentionally not passing the screwy flag and screwing it up in the first version, and still accidentally not passing the screwy flag and screwing it up in the second, so they match…)

So, even if this were a bug, there's a good chance it would be closed WONTFIX. The resolution for #22407 was to deprecate re.L for non-bytes patterns in 3.5 and remove it in 3.6, so I doubt anyone's going to care about fixing bugs with it now. (Not to mention that re itself is theoretically going away in favor of regex one of these decades… and IIRC, regex also deprecated the L flag unless you're using a bytes pattern and re-compatible mode.)

194

answered Sep 21 '22 11:09

abarnert

The last argument in the compile is flags, if you actually use flags=flags in the re.sub you will see the same behaviour:

compiled = re.compile(pattern, flags)
print(compiled)
text = 'Poplatníkem daně z pozemků je vlastník pozemku'
mark = r'**\1**' # wrap 1st matching group in double stars

r = re.sub(pattern, mark, text, flags=flags)

The fourth arg to re.sub is count so that is why you see the difference.

re.sub(pattern, repl, string, count=0, flags=0)

re.compile(pattern, flags=0)

answered Sep 24 '22 11:09

Padraic Cunningham

Related questions
                            
                                PyYAML and unusual tags
                            
                                Why does locals() return a strange self referential list?
                            
                                Flask: 'Response' object is not iterable with response-producing exceptions
                            
                                Numpy Two-Dimensional Moving Average
                            
                                SQLAlchemy How to load dates with timezone=UTC (dates stored without timezone)
                            
                                Downloading and accessing data from github python
                            
                                How to change numpy array into grayscale opencv image
                            
                                How to install Flask on Python3 using pip?
                            
                                Python - Matrix outer product
                            
                                How to catch these exceptions individually?
                            
                                Remove colorbar's borders matplotlib
                            
                                How to evaluate single integrals of multivariate functions with Python's scipy.integrate.quad?
                            
                                Celery autodiscover_tasks not working for all Django 1.7 apps
                            
                                Initializer vs Constructor [duplicate]
                            
                                Extending the behavior of an inherited function in Python
                            
                                How to implement a priority queue using SQS(Amazon simple queue service)
                            
                                Python enumerate list setting start index but without increasing end count
                            
                                Itertools product without repeating duplicates
                            
                                django difference between clear() and delete()
                            
                                apply sort to a pandas groupby operation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

regex.sub() gives different results to re.sub()

Tags:

python

regex

python-3.x

Rudolf Gröhling

People also ask

2 Answers

abarnert

Padraic Cunningham

Recent Activity

Donate For Us