Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex uppercase unicode word

Tags:

python

regex

I need to find abbreviations text in many languages. Current regex is:

import regex as re
pattern = re.compile('(?:[\w]\.)+', re.UNICODE | re.MULTILINE | re.DOTALL | re.VERSION1)
pattern.findall("U.S.A. u.s.a.")

I don't need u.s.a in the result, i need only uppercase text. [A-Z] won't work in any language except english.

like image 719
artyomboyko Avatar asked Sep 26 '12 01:09

artyomboyko


1 Answers

You need to use a Unicode character property in order to match them. re does not support character properties, but regex does.

>>> regex.findall(ur'\p{Lu}', u'ÜìÑ')
[u'\xdc', u'\xd1']
like image 120
Ignacio Vazquez-Abrams Avatar answered Sep 28 '22 15:09

Ignacio Vazquez-Abrams