I have a UTF8 string with combining diacritics. I want to match it with the <code>\w</code> regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics. <pre class="prettyprint"><code>>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> print u"ao\u00F3oz" aoóoz >>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE) >>> print u"aoo\u0301oz" aóooz </code></pre> (Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line) Is there anyway to match combining diacritics with <code>\w</code>? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.

I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib <code>re</code> package). It seems to have (among other things) more possibilities with regard to unicode. For example, it supports <code>\X</code>, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use <code>\p{M}</code> to refer to combining marks. The <code>\X</code> mentioned before is equivalent to <code>\P{M}\p{M}*</code> (a character that is NOT a combining mark, followed by zero or more combining marks). Note that this makes <code>\X</code> more or less the unicode equivalent of <code>.</code>, not of <code>\w</code>, so in your case, <code>\w\p{M}*</code> is what you need. It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer). See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).

Python regex \w doesn't match combining diacritics?

Tags:

python

regex

unicode

diacritics

unicode-normalization

I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics.

>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> print u"ao\u00F3oz"
aoóoz
>>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE)
>>> print u"aoo\u0301oz"
aóooz

(Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line)

Is there anyway to match combining diacritics with \w? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.

362

asked Jun 29 '10 13:06

Amandasaurus

1 Answers

I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).

It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).

Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.

It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).

See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).

173

answered Sep 19 '22 13:09

Steven

Related questions
                            
                                Defining a Python enum in a C extension - am I doing this right?
                            
                                How to create an interactive brain-shaped graph?
                            
                                Using an XML catalog with Python's lxml?
                            
                                How to use form values from an unbound form
                            
                                Cocoa client/server application
                            
                                How to integrate Django and Cygwin?
                            
                                How to programmatically insert comments into a Microsoft Word document?
                            
                                Django Custom Template Tags In Google App Engine
                            
                                subversion python bindings documentation? [closed]
                            
                                Using multiprocessing pool of workers
                            
                                Random int64 and float64 numbers
                            
                                Loading a simple Qt Designer form in to Pyside
                            
                                can NLTK/pyNLTK work "per language" (i.e. non-english), and how?
                            
                                How can I get Geany to show me the methods a library has when I press the '.' key?
                            
                                How to do a JOIN in SQLAlchemy on 3 tables, where one of them is mapping between other two?
                            
                                Migrating from Javadoc to Python Documentation
                            
                                How to run a piece of code in every view in django?
                            
                                Do python packages (multi-file modules) behave exactly as one big module?
                            
                                Django Formset management-form validation error
                            
                                Is this correct way to import python scripts residing in arbitrary folders?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With