I get a data from a file: <pre class="prettyprint"><code>words = re.findall(r'[\w]+',self._from.encode('utf8'),re.U) </code></pre> If the file contains: <blockquote> Hi, how are you? </blockquote> Then result will be: <blockquote> ['Hi', 'how', 'are', 'you'] </blockquote> But if the file contains russian language (i.e. cyrillic symbols), then: <blockquote> Привет, как дела? </blockquote> In this case the result is: <blockquote> ['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xba\xd0', '\xd0\xba', '\xd0', '\xd0\xb5\xd0', '\xd0'] </blockquote> why? wtf? I've already added: <pre class="prettyprint"><code>sys.setdefaultencoding('utf-8') </code></pre> I'm using python2.7 and linux ubuntu. <h3>Answer:</h3> <pre class="prettyprint"><code>words = re.findall(r'[\w]+',self._from.decode('utf8'),re.U) print u" ".join(words) </code></pre>

To use <code>\w+</code> to match alphanumeric unicode characters you should pass both a <code>unicode</code> pattern and <code>unicode</code> text to <code>re.findall</code>. <ul> <li> In Python2: Assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a <code>unicode</code>: <pre class="prettyprint"><code>uni = 'Привет, как дела?'.decode('utf-8') </code></pre> <code>ur'(?u)\w+'</code> is a raw unicode literal. Even though it is not necessary here, using raw unicode/string literals for regex patterns is generally a good practice -- it allows you to avoid the need for double backslashes before certain characters such as <code>\s</code>. The regex pattern <code>ur'(?u)\w+'</code> bakes-in the Unicode flag which tells <code>re.findall</code> to make <code>\w</code> dependent on the Unicode character properties database. <pre class="prettyprint"><code>import re uni = 'Привет, как дела?'.decode('utf-8') print(re.findall(ur'(?u)\w+', uni)) </code></pre> yields a list containing the 3 unicode "words": <pre class="prettyprint"><code>[u'\u041f\u0440\u0438\u0432\u0435\u0442', u'\u043a\u0430\u043a', u'\u0434\u0435\u043b\u0430'] </code></pre> </li> <li> In Python3: The general principle is the same, except that <a href="https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit" rel="nofollow noreferrer">what were <code>unicode</code>s in Python2 are now <code>str</code>s in Python3</a>, and there is no longer any attempt at automatic conversion between the two. So, again assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a <code>str</code>, and use a <code>str</code> regex pattern: <pre class="prettyprint"><code>import re uni = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xd0\xba\xd0\xb0\xd0\xba \xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0?'.decode('utf') print(re.findall(r'(?u)\w+', uni)) </code></pre> yields <pre class="prettyprint"><code>['Привет', 'как', 'дела'] </code></pre> </li> </ul>

My solution: <pre class="prettyprint"><code>txt = re.findall(r'[А-я]+', data) </code></pre> А-я - Russian alphabet letters

Russian symbols in re (Python)

Tags:

python

regex

utf-8

I get a data from a file:

words = re.findall(r'[\w]+',self._from.encode('utf8'),re.U)

If the file contains:

Hi, how are you?

Then result will be:

['Hi', 'how', 'are', 'you']

But if the file contains russian language (i.e. cyrillic symbols), then:

Привет, как дела?

In this case the result is:

['\xd0', '\xd1', '\xd0', '\xd0\xb2\xd0\xb5\xd1', '\xd0\xba\xd0', '\xd0\xba', '\xd0', '\xd0\xb5\xd0', '\xd0']

why? wtf? I've already added:

sys.setdefaultencoding('utf-8')

I'm using python2.7 and linux ubuntu.

Answer:

words = re.findall(r'[\w]+',self._from.decode('utf8'),re.U)
print u" ".join(words)

644

asked Mar 16 '13 10:03

Queen johniek

2 Answers

To use \w+ to match alphanumeric unicode characters you should pass both a unicode pattern and unicode text to re.findall.

In Python2:

Assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a unicode:
```
uni = 'Привет, как дела?'.decode('utf-8')
```
ur'(?u)\w+' is a raw unicode literal. Even though it is not necessary here, using raw unicode/string literals for regex patterns is generally a good practice -- it allows you to avoid the need for double backslashes before certain characters such as \s.

The regex pattern ur'(?u)\w+' bakes-in the Unicode flag which tells re.findall to make \w dependent on the Unicode character properties database.
```
import re
uni = 'Привет, как дела?'.decode('utf-8')
print(re.findall(ur'(?u)\w+', uni))
```
yields a list containing the 3 unicode "words":
```
[u'\u041f\u0440\u0438\u0432\u0435\u0442',
 u'\u043a\u0430\u043a',
 u'\u0434\u0435\u043b\u0430']
```
In Python3:

The general principle is the same, except that what were unicodes in Python2 are now strs in Python3, and there is no longer any attempt at automatic conversion between the two. So, again assuming that you are reading bytes (not text) from the file, you should decode the bytes to obtain a str, and use a str regex pattern:
```
import re
uni = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82, \xd0\xba\xd0\xb0\xd0\xba \xd0\xb4\xd0\xb5\xd0\xbb\xd0\xb0?'.decode('utf')
print(re.findall(r'(?u)\w+', uni))
```
yields
```
['Привет', 'как', 'дела']
```

182

answered Sep 17 '22 20:09

unutbu

My solution:

txt = re.findall(r'[А-я]+', data)

А-я - Russian alphabet letters

answered Sep 19 '22 20:09

Dmitry

Related questions
                            
                                Using numpy.argmax() on multidimensional arrays
                            
                                How to expose data to zabbix
                            
                                Pygame Error: Video System not Initialized [duplicate]
                            
                                Python REST frameworks for App Engine?
                            
                                Is there a good way to do this type of mining?
                            
                                Correct usage of a getter/setter for dictionary values
                            
                                How do I check the content of a Django cache with Python memcached?
                            
                                Orthogonal regression fitting in scipy least squares method
                            
                                Assigning NoneType to Dict
                            
                                When is it appropriate to use a database , in Python
                            
                                XML parsing in python: expaterror not well-formed
                            
                                Does assigning another variable to a string make a copy or increase the reference count
                            
                                OpenERP Unique Constraint
                            
                                Elementwise if elif function in python using arrays
                            
                                Large file not flushed to disk immediately after calling close()?
                            
                                scipy.optimize.curvefit() - array must not contain infs or NaNs
                            
                                Find the selected option using BeautifulSoup
                            
                                Numpy vectorize as a decorator with arguments
                            
                                Can I use rpy2 to save a pandas dataframe to an .Rdata file?
                            
                                Fit two normal distributions (histograms) with MCMC using pymc?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With