I have a string from which i want to extract 3 groups:
'19 janvier 2012' -> '19', 'janvier', '2012'
Month name could contain non ASCII characters, so [A-Za-z]
does not work for me:
>>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>>
I could use \w
but it matches digits and underscore:
>>> re.search(ur'(\w+)', u'février', re.UNICODE).groups() (u'f\xe9vrier',) >>> re.search(ur'(\w+)', u'fé_q23vrier', re.UNICODE).groups() (u'f\xe9_q23vrier',) >>>
I tried to use [:alpha:], but it's not working:
>>> re.search(ur'[:alpha:]+', u'février', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>>
If i could somehow match \w
without [_0-9]
, but i don't know how. And even if i find out how to do this, is there a ready shortcut like [:alpha:]
which works in Python?
This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.
Using m option allows it to match newline as well. Matches any single character in brackets. Matches 0 or more occurrences of preceding expression. Matches 1 or more occurrence of preceding expression.
Inside a character range, \b represents the backspace character, for compatibility with Python's string literals. \B. Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python' , 'py3' , 'py2' , but not 'py' , 'py.' , or 'py!'
Unicode Literals in Python Source Code Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.
You can construct a new character class:
[^\W\d_]
instead of \w
. Translated into English, it means "Any character that is not a non-alphanumeric character ([^\W]
is the same as \w
), but that is also not a digit and not an underscore".
Therefore, it will only allow Unicode letters (if you use the re.UNICODE
compile option).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With