I have a string from which i want to extract 3 groups: <pre class="prettyprint"><code>'19 janvier 2012' -> '19', 'janvier', '2012' </code></pre> Month name could contain non ASCII characters, so <code>[A-Za-z]</code> does not work for me: <pre class="prettyprint"><code>>>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>> </code></pre> I could use <code>\w</code> but it matches digits and underscore: <pre class="prettyprint"><code>>>> re.search(ur'(\w+)', u'février', re.UNICODE).groups() (u'f\xe9vrier',) >>> re.search(ur'(\w+)', u'fé_q23vrier', re.UNICODE).groups() (u'f\xe9_q23vrier',) >>> </code></pre> I tried to use [:alpha:], but it's not working: <pre class="prettyprint"><code>>>> re.search(ur'[:alpha:]+', u'février', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>> </code></pre> If i could somehow match <code>\w</code> without <code>[_0-9]</code>, but i don't know how. And even if i find out how to do this, is there a ready shortcut like <code>[:alpha:]</code> which works in Python?

You can construct a new character class: <pre class="prettyprint"><code>[^\W\d_] </code></pre> instead of <code>\w</code>. Translated into English, it means "Any character that is not a non-alphanumeric character (<code>[^\W]</code> is the same as <code>\w</code>), but that is also not a digit and not an underscore". Therefore, it will only allow Unicode letters (if you use the <code>re.UNICODE</code> compile option).

Matching only a unicode letter in Python re

Tags:

python

regex

unicode

character-properties

I have a string from which i want to extract 3 groups:

'19 janvier 2012' -> '19', 'janvier', '2012'

Month name could contain non ASCII characters, so [A-Za-z] does not work for me:

>>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last):   File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>>

I could use \w but it matches digits and underscore:

>>> re.search(ur'(\w+)', u'février', re.UNICODE).groups() (u'f\xe9vrier',) >>> re.search(ur'(\w+)', u'fé_q23vrier', re.UNICODE).groups() (u'f\xe9_q23vrier',) >>>

I tried to use [:alpha:], but it's not working:

>>> re.search(ur'[:alpha:]+', u'février', re.UNICODE).groups() Traceback (most recent call last):   File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>>

If i could somehow match \w without [_0-9], but i don't know how. And even if i find out how to do this, is there a ready shortcut like [:alpha:] which works in Python?

988

asked Jan 19 '12 09:01

warvariuc

1 Answers

You can construct a new character class:

[^\W\d_]

instead of \w. Translated into English, it means "Any character that is not a non-alphanumeric character ([^\W] is the same as \w), but that is also not a digit and not an underscore".

Therefore, it will only allow Unicode letters (if you use the re.UNICODE compile option).

111

answered Sep 28 '22 14:09

Tim Pietzcker

Related questions
                            
                                scrape websites with infinite scrolling
                            
                                Moving back an iteration in a for loop
                            
                                How to pass command line arguments to ipython
                            
                                How to generate test report using pytest?
                            
                                Numpy matrix binarization using only one expression
                            
                                Checking if list is a sublist
                            
                                How to save model.summary() to file in Keras?
                            
                                How to show line numbers in Google Colaboratory?
                            
                                Error while downloading the requirements using pip install (setup command: use_2to3 is invalid.)
                            
                                How would I check a string for a certain letter in Python?
                            
                                How do I refresh the values on an object in Django?
                            
                                Non-Message Queue / Simple Long-Polling in Python (and Flask)
                            
                                Can't find Python.h file on CentOS
                            
                                id()s of bound and unbound method objects --- sometimes the same for different objects, sometimes different for the same object
                            
                                Python: How do I pass a string by reference?
                            
                                How to pass a Numpy array into a cffi function and how to get one back out?
                            
                                Using Python, write an Excel file with columns copied from another Excel file [closed]
                            
                                xlsx and xlsm files return badzipfile: file is not a zip file
                            
                                How to get syntax highlighting on Kivy, .kv, file in Pycharm on OSX? [duplicate]
                            
                                Return or yield from a function that calls a generator?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With