Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wrong charset after re.sub()

I have this code

import chardet, re    

content = "Бланк свидетельства о допуске."
print content
print chardet.detect(content)
content = re.sub(u"(?i)[^-0-9a-zа-яё«»\&\;\/\<\>\.,\s\(\)\*:!\?]", "", content)
print content
print chardet.detect(content)

And output

Бланк свидетельства о допуске.
{'confidence': 0.99, 'encoding': 'utf-8'}
� �  .
{'confidence': 0.5, 'encoding': 'windows-1252'}

What am I doing wrong? How can I get uft-8 string after re.sub()? (Python 2.7,# coding: utf-8, file in UTF-8, IDE Pycharm).

Thanks.

like image 778
Patrick Burns Avatar asked Dec 26 '22 05:12

Patrick Burns


2 Answers

This is what (I think) you're trying to achieve (I've simplified the regexp for clarity):

#coding=utf8
import re    
content = u"Бланк XYZ свидетельства о ???допуске."
content = re.sub(u"(?iu)[^а-яё]", ".", content)
print content.encode('utf8') # Бланк.....свидетельства.о....допуске.

Note the important points:

  • the subject is unicode
  • the expression is unicode
  • the expression uses the unicode flag (?u) to make case folding work.

Also, for serious unicode work I recommend the regex module, which provides excellent and almost complete unicode support. Consider:

# drop everything except Cyrillic and spaces 
import regex
content = regex.sub(u'[^\p{Cyrillic}\p{Zs}]', '', content) 

Although it's documented that re.UNICODE only alters \w and friends, in my tests it also affects case folding (re.IGNORECASE):

Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> src = u'Σσ Φφ Γγ'
>>> src
u'\u03a3\u03c3 \u03a6\u03c6 \u0393\u03b3'
>>> re.sub(ur'(?i)[α-ώ]', '-', src)
u'\u03a3- \u03a6- \u0393-'
>>> re.sub(ur'(?iu)[α-ώ]', '-', src)
u'-- -- --'

So that's either an undocumented feature or a documentation problem.

like image 90
georg Avatar answered Dec 28 '22 23:12

georg


Your input is UTF-8:

>>> content
'\xd0\x91\xd0\xbb\xd0\xb0\xd0\xbd\xd0\xba \xd1\x81\xd0\xb2\xd0\xb8\xd0\xb4\xd0\xb5\xd1\x82\xd0\xb5\xd0\xbb\xd1\x8c\xd1\x81\xd1\x82\xd0\xb2\xd0\xb0 \xd0\xbe \xd0\xb4\xd0\xbe\xd0\xbf\xd1\x83\xd1\x81\xd0\xba\xd0\xb5.'

But you are using a unicode regular expression. The expression is matched directly against the raw bytes of your UTF-8 input.

Of all those bytes, only the spaces, the full stop and the \xbb byte (as the » character), are not removed. The rest of the individual bytes are deleted, because they do not fall in your negative character class [^...].

Using Unicode properly (by decoding content to unicode first) works:

>>> re.sub(u"(?i)[^-0-9a-zа-яё«»\&\;\/\<\>\.,\s\(\)\*:!\?]", "", content.decode('utf8'))
u'\u043b\u0430\u043d\u043a \u0441\u0432\u0438\u0434\u0435\u0442\u0435\u043b\u044c\u0441\u0442\u0432\u0430 \u043e \u0434\u043e\u043f\u0443\u0441\u043a\u0435.'
>>> print re.sub(u"(?i)[^-0-9a-zа-яё«»\&\;\/\<\>\.,\s\(\)\*:!\?]", "", content.decode('utf8'))
ланк свидетельства о допуске.

The alternative would be to use a raw byte string for the regular expression, and matching byte combinations. Working out what UTF-8 bytes and ranges are valid is very, very tricky. You'll need to fully understand how UTF-8 encodes characters to multiple bytes, then translate your negative character class to a set of positive matches that allow through the same byte combinations. This is not for the faint of heart.

like image 44
Martijn Pieters Avatar answered Dec 28 '22 22:12

Martijn Pieters