I need to find abbreviations text in many languages. Current regex is:
import regex as re
pattern = re.compile('(?:[\w]\.)+', re.UNICODE | re.MULTILINE | re.DOTALL | re.VERSION1)
pattern.findall("U.S.A. u.s.a.")
I don't need u.s.a in the result, i need only uppercase text. [A-Z] won't work in any language except english.
You need to use a Unicode character property in order to match them. re
does not support character properties, but regex
does.
>>> regex.findall(ur'\p{Lu}', u'ÜìÑ')
[u'\xdc', u'\xd1']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With