Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching only a unicode letter in Python re

I have a string from which i want to extract 3 groups:

'19 janvier 2012' -> '19', 'janvier', '2012' 

Month name could contain non ASCII characters, so [A-Za-z] does not work for me:

>>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last):   File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>>  

I could use \w but it matches digits and underscore:

>>> re.search(ur'(\w+)', u'février', re.UNICODE).groups() (u'f\xe9vrier',) >>> re.search(ur'(\w+)', u'fé_q23vrier', re.UNICODE).groups() (u'f\xe9_q23vrier',) >>>  

I tried to use [:alpha:], but it's not working:

>>> re.search(ur'[:alpha:]+', u'février', re.UNICODE).groups() Traceback (most recent call last):   File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>>  

If i could somehow match \w without [_0-9], but i don't know how. And even if i find out how to do this, is there a ready shortcut like [:alpha:] which works in Python?

like image 988
warvariuc Avatar asked Jan 19 '12 09:01

warvariuc


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

How do you match a character in Python?

Using m option allows it to match newline as well. Matches any single character in brackets. Matches 0 or more occurrences of preceding expression. Matches 1 or more occurrence of preceding expression.

What does \b mean in Python re?

Inside a character range, \b represents the backspace character, for compatibility with Python's string literals. \B. Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python' , 'py3' , 'py2' , but not 'py' , 'py.' , or 'py!'

How do you escape a Unicode character in Python?

Unicode Literals in Python Source Code Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.


1 Answers

You can construct a new character class:

[^\W\d_] 

instead of \w. Translated into English, it means "Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore".

Therefore, it will only allow Unicode letters (if you use the re.UNICODE compile option).

like image 111
Tim Pietzcker Avatar answered Sep 28 '22 14:09

Tim Pietzcker