Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behavior of re.sub with utf-8 strings

Could anyone explain me this strange behavior? I would expect both replace methods to work or not to work at the same time. Is it just me or is there anyone who doesn't find this to be coherent?

>>> u'è'.replace("\xe0","")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
>>> re.sub(u'è','\xe0','',flags=re.UNICODE)
''

(Please note that I'm not asking for an explanation of why u'è'.replace("\xe0","") raises an error!)

like image 337
luke14free Avatar asked May 07 '12 15:05

luke14free


2 Answers

From Unicode Doc

the arguments to these methods can be Unicode strings or 8-bit strings. 8-bit strings will be converted to Unicode before carrying out the operation; Python’s default ASCII encoding will be used, so characters greater than 127 will cause an exception

From Re Doc:

This module provides regular expression matching operations similar to those found in Perl. Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings.

Since for the Re module you are not explicitly specifying the Unicode flag, it is not attempting the conversion and hence not raising the error. That is why they aren't behaving coherently

like image 163
subiet Avatar answered Oct 20 '22 08:10

subiet


Python 2.X has a somewhat unnatural handling of encoding, which takes implicit conversion. It will try to play with unicode and no-unicode strings, when conversion isn't taken care of by the user. In the end, that doesn't solve the issue: encoding has to be acknowledged by the developer from the get-go. Python 2 just makes things less explicit and a bit less obvious.

>>> u'è'.replace(u"\xe0", u"")
u'\xe8'

That's your original example, except, I specifically told Python that all strings were unicode. If you don't, Python will try to convert them. And because the default encoding in Python 2 is ASCII, that will obviously fail with your example.

Encoding is a tricky subject, but with some good habits (like early conversion, always being sure of what type of data is handled by the program at a given point), it usually (and I insist, USUALLY) goes well.

Hope that helps!

like image 35
vincent-lg Avatar answered Oct 20 '22 10:10

vincent-lg