Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem with accented characters [closed]

I'm having a problem on the accentuation of Python regular expression, I'm doing the following attempt:

import re
ER = re.compile(r'\w', re.L)
print(ER.sub('.','Maçã'))

..çã

Even using the re.compile passing the locale as an argument, the accents are not recognized. Has anyone had this problem?

Thanks!

like image 929
Michel Andrade Avatar asked Jan 24 '11 12:01

Michel Andrade


2 Answers

You're better off using re.U unicode flag.

If using Python 2.x you also need to specify the string as unicode, i.e.

print(ER.sub('.', u'Maçã'))
like image 192
SilentGhost Avatar answered Sep 19 '22 22:09

SilentGhost


From http://www.regular-expressions.info/python.html

By default, Python's regex engine only considers the letters A through Z, the digits 0 through 9, and the underscore as "word characters". Specify the flag re.L or re.LOCALE to make \w match all characters that are considered letters given the current locale settings. Alternatively, you can specify re.U or re.UNICODE to treat all letters from all scripts as word characters. The setting also affects word boundaries.

Try using re.UNICODE.

like image 29
MeanEYE Avatar answered Sep 18 '22 22:09

MeanEYE