I have this script to test a regex and how unicode behaves: <pre class="prettyprint"><code># -*- coding: utf-8 -*- import re p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:" w = re.findall('[a-zA-ZÑñ]+',p.decode('utf-8'), re.UNICODE) print(w) </code></pre> And the <code>print</code> statement is showing this: <pre class="prettyprint"><code>[u'Solo', u'voy', u'si', u'se', u'sucedier', u'n', u'o', u'se', u'suceden', u'ma', u'ana', u'los', u'siguien', u'es', u'eventos'] </code></pre> <code>"sucedierón"</code> is being transformed to <code>"u'sucedier', u'n'"</code>, and similarly <code>"mañana"</code> becomes <code>"u'ma', u'ana'"</code>. I have tried decoding, adding <code>'\xc3\xb1a'</code> to the regex for <code>'Ñ'</code> Later after reading some docs I realized that using <code>[a-zA-Z]</code> just matches ASCII character. That is why I had to change to <code>r'\b\w+\b'</code> so I can add flags to the regex <pre class="prettyprint"><code>w = re.findall(r'\b\w+\b', p, re.UNICODE) </code></pre> But this didn't work. I also tried to <code>decode()</code> first and <code>findall()</code> later: <pre class="prettyprint"><code>p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:" U = p.decode('utf8') </code></pre> If I print variable <code>U</code> <pre class="prettyprint"><code>"Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:" </code></pre> I see that the output is as expected, but when I use the <code>findall()</code> again: <pre class="prettyprint"><code>[u'Solo', u'voy', u'si', u'se', u'sucedier\xf3n', u'o', u'se', u'suceden', u'ma\xf1ana', u'los', u'siguien\xf1es', u'eventos'] </code></pre> Now the word is complete but <code>ó</code> is replaced with <code>\xf3n</code> and <code>ñ</code> is replaced with <code>\xf1</code>, unicode values. How can I <code>findall()</code> and get the non-ASCII characters <code>"ñ","á", "é", "í", "ó", "ú"</code> I now there are a lot of this kind of questions in SO, and believe me I read a lot of them, but i just cannot find the missing part. EDIT I am using python 2.7 EDIT 2 Can someone else try what @LetzerWille suggest? Is not working for me

<h3>Regex with accented characters (diacritics) in Python</h3> The <code>re.UNICODE</code> flag allows you to use word characters <code>\w</code> and word boundaries <code>\b</code> with diacritics (accents and tildes). This is extremely useful to match words in different languages. <ol> <li>Decode your text from UTF-8 to unicode </li> <li>Make sure the pattern and the subject text are passed as unicode to the regex functions.</li> <li>The result is an array of bytes that can be looped/mapped to encode back again to UTF-8</li> <li>Printing the array shows non-ASCII bytes escaped, but it's safe to print each string independently.</li> </ol> Code: <pre class="prettyprint"><code># -*- coding: utf-8 -*- # http://stackoverflow.com/q/32872917/5290909 #python 2.7.9 import re text = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:" # Decode to unicode unicode_text = text.decode('utf8') matches = re.findall(ur'\b\w+\b', unicode_text, re.UNICODE) # Encode back again to UTF-8 utf8_matches = [ match.encode('utf-8') for match in matches ] # Print every word for utf8_word in utf8_matches: print utf8_word </code></pre> <kbd>ideone Demo</kbd>

Python - regex - special characters and ñ

Tags:

python

regex

unicode

I have this script to test a regex and how unicode behaves:

# -*- coding: utf-8 -*-
import re

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

w = re.findall('[a-zA-ZÑñ]+',p.decode('utf-8'), re.UNICODE)

print(w)

And the print statement is showing this:

[u'Solo', u'voy', u'si', u'se', u'sucedier', u'n', u'o', u'se', u'suceden', u'ma', u'ana', u'los', u'siguien', u'es', u'eventos']

"sucedierón" is being transformed to "u'sucedier', u'n'", and similarly "mañana" becomes "u'ma', u'ana'".

I have tried decoding, adding '\xc3\xb1a' to the regex for 'Ñ'

Later after reading some docs I realized that using [a-zA-Z] just matches ASCII character. That is why I had to change to r'\b\w+\b' so I can add flags to the regex

w = re.findall(r'\b\w+\b', p, re.UNICODE)

But this didn't work.

I also tried to decode() first and findall() later:

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
U = p.decode('utf8')

If I print variable U

"Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

I see that the output is as expected, but when I use the findall() again:

[u'Solo', u'voy', u'si', u'se', u'sucedier\xf3n', u'o', u'se', u'suceden', u'ma\xf1ana', u'los', u'siguien\xf1es', u'eventos']

Now the word is complete but ó is replaced with \xf3n and ñ is replaced with \xf1, unicode values.

How can I findall() and get the non-ASCII characters "ñ","á", "é", "í", "ó", "ú"

I now there are a lot of this kind of questions in SO, and believe me I read a lot of them, but i just cannot find the missing part.

EDIT

I am using python 2.7

EDIT 2 Can someone else try what @LetzerWille suggest? Is not working for me

583

asked Sep 30 '15 18:09

NachoMiguel

1 Answers

Regex with accented characters (diacritics) in Python

The re.UNICODE flag allows you to use word characters \w and word boundaries \b with diacritics (accents and tildes). This is extremely useful to match words in different languages.

Decode your text from UTF-8 to unicode
Make sure the pattern and the subject text are passed as unicode to the regex functions.
The result is an array of bytes that can be looped/mapped to encode back again to UTF-8
Printing the array shows non-ASCII bytes escaped, but it's safe to print each string independently.

Code:

# -*- coding: utf-8 -*-
# http://stackoverflow.com/q/32872917/5290909
#python 2.7.9

import re

text = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
# Decode to unicode
unicode_text = text.decode('utf8')

matches = re.findall(ur'\b\w+\b', unicode_text, re.UNICODE)

# Encode back again to UTF-8
utf8_matches = [ match.encode('utf-8') for match in matches ]

# Print every word
for utf8_word in utf8_matches:
    print utf8_word

ideone Demo

182

answered Sep 20 '22 19:09

Mariano

Related questions
                            
                                Pylint Error when using metaclass
                            
                                How to pass SIGINT to child process with Python subprocess.Popen() using shell = true
                            
                                Peak Detection in Python: How does the scipy.signal.find_peaks_cwt function work?
                            
                                Python write valid json with newlines to file
                            
                                Scikit-learn: How to calculate the True Negative
                            
                                py2exe 64 bit python 2.7 installation
                            
                                Scikit F-score metric error
                            
                                Disable warnings originating from scipy
                            
                                Pandas DataFrame.merge MemoryError
                            
                                Load YAML nested with Jinja2 in Python
                            
                                what is meant by rv_frozen object in scipy?
                            
                                Changing the xlim by date in Matplotlib
                            
                                xlrd read number as string
                            
                                What is "__docformat__" used for in Python?
                            
                                Set convergence tolerance for scipy.optimize.minimize(method='L-BFGS-B')
                            
                                Why does a set of numbers appear to be sorted? [duplicate]
                            
                                multiple processors logging to same rotate file
                            
                                How can I switch using pip between system and anaconda
                            
                                How to generate a file without saving it to disk in python?
                            
                                How to pass a python list to C function (dll) using ctypes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With