Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

matching unicode characters in python regular expressions

I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work.

>>> import re >>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg') >>> print m.groupdict() {'tag': 'xmas', 'filename': 'xmas1.jpg'} 

All is well, then I try something with Norwegian characters in it ( or something more unicode-like ):

>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg') >>> print m.groupdict() Traceback (most recent call last): File "<interactive input>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groupdict' 

How can I match typical unicode characters, like øæå? I`d like to be able to match those characters as well, in both the tag-group above and the one for filename.

like image 962
Weholt Avatar asked Feb 17 '11 12:02

Weholt


People also ask

Does regex work with unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

How do I match a character in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

How do you escape a unicode character in Python?

Unicode Literals in Python Source Code Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.


1 Answers

You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict() {'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'} 

This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode.

like image 200
Thomas Avatar answered Sep 30 '22 16:09

Thomas