I have this script to test a regex and how unicode behaves:
# -*- coding: utf-8 -*-
import re
p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
w = re.findall('[a-zA-ZÑñ]+',p.decode('utf-8'), re.UNICODE)
print(w)
And the print
statement is showing this:
[u'Solo', u'voy', u'si', u'se', u'sucedier', u'n', u'o', u'se', u'suceden', u'ma', u'ana', u'los', u'siguien', u'es', u'eventos']
"sucedierón"
is being transformed to "u'sucedier', u'n'"
, and similarly "mañana"
becomes "u'ma', u'ana'"
.
I have tried decoding, adding '\xc3\xb1a'
to the regex for 'Ñ'
Later after reading some docs I realized that using [a-zA-Z]
just matches ASCII character. That is why I had to change to r'\b\w+\b'
so I can add flags to the regex
w = re.findall(r'\b\w+\b', p, re.UNICODE)
But this didn't work.
I also tried to decode()
first and findall()
later:
p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
U = p.decode('utf8')
If I print variable U
"Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
I see that the output is as expected, but when I use the findall()
again:
[u'Solo', u'voy', u'si', u'se', u'sucedier\xf3n', u'o', u'se', u'suceden', u'ma\xf1ana', u'los', u'siguien\xf1es', u'eventos']
Now the word is complete but ó
is replaced with \xf3n
and ñ
is replaced with \xf1
, unicode values.
How can I findall()
and get the non-ASCII characters "ñ","á", "é", "í", "ó", "ú"
I now there are a lot of this kind of questions in SO, and believe me I read a lot of them, but i just cannot find the missing part.
EDIT
I am using python 2.7
EDIT 2 Can someone else try what @LetzerWille suggest? Is not working for me
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
In regular expressions, you can use the single escape to remove the special meaning of regex symbols. For example, to match the dot or asterisk characters '. ' and '*' , you must first get rid of the special meanings of the regex dot . or regex asterisk * operators by escaping them with \. or \* .
Use the re. split() method to split a string on all special characters. The re. split() method takes a pattern and a string and splits the string on each occurrence of the pattern.
The re.UNICODE
flag allows you to use word characters \w
and word boundaries \b
with diacritics (accents and tildes). This is extremely useful to match words in different languages.
Code:
# -*- coding: utf-8 -*-
# http://stackoverflow.com/q/32872917/5290909
#python 2.7.9
import re
text = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
# Decode to unicode
unicode_text = text.decode('utf8')
matches = re.findall(ur'\b\w+\b', unicode_text, re.UNICODE)
# Encode back again to UTF-8
utf8_matches = [ match.encode('utf-8') for match in matches ]
# Print every word
for utf8_word in utf8_matches:
print utf8_word
ideone Demo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With