I have this code
import chardet, re
content = "Бланк свидетельства о допуске."
print content
print chardet.detect(content)
content = re.sub(u"(?i)[^-0-9a-zа-яё«»\&\;\/\<\>\.,\s\(\)\*:!\?]", "", content)
print content
print chardet.detect(content)
And output
Бланк свидетельства о допуске.
{'confidence': 0.99, 'encoding': 'utf-8'}
� � .
{'confidence': 0.5, 'encoding': 'windows-1252'}
What am I doing wrong? How can I get uft-8 string after re.sub()
?
(Python 2.7,# coding: utf-8
, file in UTF-8, IDE Pycharm).
Thanks.
This is what (I think) you're trying to achieve (I've simplified the regexp for clarity):
#coding=utf8
import re
content = u"Бланк XYZ свидетельства о ???допуске."
content = re.sub(u"(?iu)[^а-яё]", ".", content)
print content.encode('utf8') # Бланк.....свидетельства.о....допуске.
Note the important points:
(?u)
to make case folding work.Also, for serious unicode work I recommend the regex module, which provides excellent and almost complete unicode support. Consider:
# drop everything except Cyrillic and spaces
import regex
content = regex.sub(u'[^\p{Cyrillic}\p{Zs}]', '', content)
Although it's documented that re.UNICODE
only alters \w
and friends, in my tests it also affects case folding (re.IGNORECASE
):
Python 2.7.2+ (default, Oct 4 2011, 20:06:09)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> src = u'Σσ Φφ Γγ'
>>> src
u'\u03a3\u03c3 \u03a6\u03c6 \u0393\u03b3'
>>> re.sub(ur'(?i)[α-ώ]', '-', src)
u'\u03a3- \u03a6- \u0393-'
>>> re.sub(ur'(?iu)[α-ώ]', '-', src)
u'-- -- --'
So that's either an undocumented feature or a documentation problem.
Your input is UTF-8:
>>> content
'\xd0\x91\xd0\xbb\xd0\xb0\xd0\xbd\xd0\xba \xd1\x81\xd0\xb2\xd0\xb8\xd0\xb4\xd0\xb5\xd1\x82\xd0\xb5\xd0\xbb\xd1\x8c\xd1\x81\xd1\x82\xd0\xb2\xd0\xb0 \xd0\xbe \xd0\xb4\xd0\xbe\xd0\xbf\xd1\x83\xd1\x81\xd0\xba\xd0\xb5.'
But you are using a unicode regular expression. The expression is matched directly against the raw bytes of your UTF-8 input.
Of all those bytes, only the spaces, the full stop and the \xbb
byte (as the »
character), are not removed. The rest of the individual bytes are deleted, because they do not fall in your negative character class [^...]
.
Using Unicode properly (by decoding content
to unicode first) works:
>>> re.sub(u"(?i)[^-0-9a-zа-яё«»\&\;\/\<\>\.,\s\(\)\*:!\?]", "", content.decode('utf8'))
u'\u043b\u0430\u043d\u043a \u0441\u0432\u0438\u0434\u0435\u0442\u0435\u043b\u044c\u0441\u0442\u0432\u0430 \u043e \u0434\u043e\u043f\u0443\u0441\u043a\u0435.'
>>> print re.sub(u"(?i)[^-0-9a-zа-яё«»\&\;\/\<\>\.,\s\(\)\*:!\?]", "", content.decode('utf8'))
ланк свидетельства о допуске.
The alternative would be to use a raw byte string for the regular expression, and matching byte combinations. Working out what UTF-8 bytes and ranges are valid is very, very tricky. You'll need to fully understand how UTF-8 encodes characters to multiple bytes, then translate your negative character class to a set of positive matches that allow through the same byte combinations. This is not for the faint of heart.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With