Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the proper regular expression to match all utf-8/unicode lowercase letter forms

Tags:

I would like to match all lowercase letter forms in the Latin block. The trivial '[a-z]' only matches characters between U+0061 and U+007A, and not all the other lowercase forms.

I would like to match all lowercase letters, most importantly, all the accented lowercase letters in the Latin block used in EFIGS languages.

[a-zà-ý] is a start, but there are still tons of other lowercase characters (see http://www.unicode.org/charts/PDF/U0000.pdf). Is there a recommended way of doing this?

FYI I'm using Python, but I suspect that this problem is cross-language.

Python's builtin "islower()" method seems to do the right checking:

lower = '' for c in xrange(0,2**16):    if unichr(c).islower():      lower += unichr(c)  print lower  
like image 409
slacy Avatar asked Mar 07 '11 20:03

slacy


People also ask

What is the regex for Unicode?

To match a specific Unicode code point, use \uFFFF where FFFF is the hexadecimal number of the code point you want to match. You must always specify 4 hexadecimal digits E.g. \u00E0 matches à, but only when encoded as a single code point U+00E0.

What does ?= Mean in regex?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

Can I use Unicode regex?

As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.


2 Answers

Python does not currently support Unicode properties in regular expressions. See this answer for a link to the Ponyguruma library which does support them.

Using such a library, you could use \p{Ll} to match any lowercase letter in a Unicode string.

Every character in the Unicode standard is in exactly one category. \p{Ll} is the category of lowercase letters, while \p{L} comprises all the characters in one of the "Letter" categories (Letter, uppercase; Letter, lowercase; Letter, titlecase; Letter, modifier; and Letter, other). For more information see the Character Properties chapter of the Unicode Standard. Or see this page for a good explanation on use of Unicode in regular expressions.

like image 149
Avi Avatar answered Oct 01 '22 13:10

Avi


Looks as though this recipe posted back in the old 2005

import sys, re  uppers = [u'[']  for i in xrange(sys.maxunicode):    c = unichr(i)    if c.isupper(): uppers.append(c)  uppers.append(u']')  uppers = u"".join(uppers)  uppers_re = re.compile(uppers)   print uppers_re.match('A') 

is still relevant.

like image 44
Antony Hatchkins Avatar answered Oct 01 '22 13:10

Antony Hatchkins