Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - regex - special characters and ñ

I have this script to test a regex and how unicode behaves:

# -*- coding: utf-8 -*-
import re

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

w = re.findall('[a-zA-ZÑñ]+',p.decode('utf-8'), re.UNICODE)

print(w)

And the print statement is showing this:

[u'Solo', u'voy', u'si', u'se', u'sucedier', u'n', u'o', u'se', u'suceden', u'ma', u'ana', u'los', u'siguien', u'es', u'eventos']

"sucedierón" is being transformed to "u'sucedier', u'n'", and similarly "mañana" becomes "u'ma', u'ana'".

I have tried decoding, adding '\xc3\xb1a' to the regex for 'Ñ'

Later after reading some docs I realized that using [a-zA-Z] just matches ASCII character. That is why I had to change to r'\b\w+\b' so I can add flags to the regex

w = re.findall(r'\b\w+\b', p, re.UNICODE) 

But this didn't work.

I also tried to decode() first and findall() later:

p = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
U = p.decode('utf8')

If I print variable U

"Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"

I see that the output is as expected, but when I use the findall() again:

[u'Solo', u'voy', u'si', u'se', u'sucedier\xf3n', u'o', u'se', u'suceden', u'ma\xf1ana', u'los', u'siguien\xf1es', u'eventos']

Now the word is complete but ó is replaced with \xf3n and ñ is replaced with \xf1, unicode values.

How can I findall() and get the non-ASCII characters "ñ","á", "é", "í", "ó", "ú"

I now there are a lot of this kind of questions in SO, and believe me I read a lot of them, but i just cannot find the missing part.

EDIT

I am using python 2.7

EDIT 2 Can someone else try what @LetzerWille suggest? Is not working for me

like image 583
NachoMiguel Avatar asked Sep 30 '15 18:09

NachoMiguel


People also ask

How do you use special characters in regex Python?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

How do you escape special characters in regex Python?

In regular expressions, you can use the single escape to remove the special meaning of regex symbols. For example, to match the dot or asterisk characters '. ' and '*' , you must first get rid of the special meanings of the regex dot . or regex asterisk * operators by escaping them with \. or \* .

How do you separate special characters from a string in Python?

Use the re. split() method to split a string on all special characters. The re. split() method takes a pattern and a string and splits the string on each occurrence of the pattern.


1 Answers

Regex with accented characters (diacritics) in Python

The re.UNICODE flag allows you to use word characters \w and word boundaries \b with diacritics (accents and tildes). This is extremely useful to match words in different languages.

  1. Decode your text from UTF-8 to unicode
  2. Make sure the pattern and the subject text are passed as unicode to the regex functions.
  3. The result is an array of bytes that can be looped/mapped to encode back again to UTF-8
  4. Printing the array shows non-ASCII bytes escaped, but it's safe to print each string independently.

Code:

# -*- coding: utf-8 -*-
# http://stackoverflow.com/q/32872917/5290909
#python 2.7.9

import re

text = "Solo voy si se sucedierón o se suceden mañana los siguienñes eventos:"
# Decode to unicode
unicode_text = text.decode('utf8')

matches = re.findall(ur'\b\w+\b', unicode_text, re.UNICODE)

# Encode back again to UTF-8
utf8_matches = [ match.encode('utf-8') for match in matches ]

# Print every word
for utf8_word in utf8_matches:
    print utf8_word

ideone Demo

like image 182
Mariano Avatar answered Sep 20 '22 19:09

Mariano