I get a data from a file:
words = re.findall(r'[\w]+',self._from.encode('utf8'),re.U)
If the file contains:
Hi, how are you?
Then result will be:
['Hi', 'how', 'are', 'you']
But if the file contains russian language (i.e. cyrillic symbols), then:
Привет, как дела?
In this case the result is:
['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xba\xd0', '\xd0\xba', '\xd0', '\xd0\xb5\xd0', '\xd0']
why? wtf? I've already added:
sys.setdefaultencoding('utf-8')
I'm using python2.7 and linux ubuntu.
words = re.findall(r'[\w]+',self._from.decode('utf8'),re.U)
print u" ".join(words)
Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.
\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character. \b -- boundary between word and non-word.
Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re. match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")
To use \w+
to match alphanumeric unicode characters you should pass both a unicode
pattern and unicode
text to re.findall
.
In Python2:
Assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a unicode
:
uni = 'Привет, как дела?'.decode('utf-8')
ur'(?u)\w+'
is a raw unicode literal.
Even though it is not necessary here, using raw unicode/string literals for
regex patterns is generally a good practice -- it allows you to avoid the
need for double backslashes before certain characters such as \s
.
The regex pattern ur'(?u)\w+'
bakes-in the Unicode flag which tells re.findall
to make \w
dependent on the Unicode character properties database.
import re
uni = 'Привет, как дела?'.decode('utf-8')
print(re.findall(ur'(?u)\w+', uni))
yields a list containing the 3 unicode "words":
[u'\u041f\u0440\u0438\u0432\u0435\u0442',
u'\u043a\u0430\u043a',
u'\u0434\u0435\u043b\u0430']
In Python3:
The general principle is the same, except that what were unicode
s in
Python2 are now str
s in Python3, and there is no longer any attempt at
automatic conversion between the two. So, again assuming that you are
reading bytes (not text) from the file, you should decode the bytes to
obtain a str
, and use a str
regex pattern:
import re
uni = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xd0\xba\xd0\xb0\xd0\xba \xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0?'.decode('utf')
print(re.findall(r'(?u)\w+', uni))
yields
['Привет', 'как', 'дела']
My solution:
txt = re.findall(r'[А-я]+', data)
А-я - Russian alphabet letters
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With