What is the proper regular expression to match all utf-8/unicode lowercase letter forms

Tags:

I would like to match all lowercase letter forms in the Latin block. The trivial '[a-z]' only matches characters between U+0061 and U+007A, and not all the other lowercase forms.

I would like to match all lowercase letters, most importantly, all the accented lowercase letters in the Latin block used in EFIGS languages.

[a-zà-ý] is a start, but there are still tons of other lowercase characters (see http://www.unicode.org/charts/PDF/U0000.pdf). Is there a recommended way of doing this?

FYI I'm using Python, but I suspect that this problem is cross-language.

Python's builtin "islower()" method seems to do the right checking:

lower = '' for c in xrange(0,2**16):    if unichr(c).islower():      lower += unichr(c)  print lower

409

asked Mar 07 '11 20:03

slacy

2 Answers

Python does not currently support Unicode properties in regular expressions. See this answer for a link to the Ponyguruma library which does support them.

Using such a library, you could use \p{Ll} to match any lowercase letter in a Unicode string.

Every character in the Unicode standard is in exactly one category. \p{Ll} is the category of lowercase letters, while \p{L} comprises all the characters in one of the "Letter" categories (Letter, uppercase; Letter, lowercase; Letter, titlecase; Letter, modifier; and Letter, other). For more information see the Character Properties chapter of the Unicode Standard. Or see this page for a good explanation on use of Unicode in regular expressions.

149

answered Oct 01 '22 13:10

Avi

Looks as though this recipe posted back in the old 2005

import sys, re  uppers = [u'[']  for i in xrange(sys.maxunicode):    c = unichr(i)    if c.isupper(): uppers.append(c)  uppers.append(u']')  uppers = u"".join(uppers)  uppers_re = re.compile(uppers)   print uppers_re.match('A')

is still relevant.

answered Oct 01 '22 13:10

Antony Hatchkins

Related questions
                            
                                Can I get an XML AST dump of C/C++ code with clang without using the compiler?
                            
                                define method to return type of class extending it
                            
                                How to create, structure, maintain and update data codebooks in R?
                            
                                DISTINCT results in ORA-01791: not a SELECTed expression
                            
                                Weightx and Weighty in Java GridBagLayout
                            
                                How to parse dynamic JSON fields with GSON?
                            
                                How do Python parsers handle indentation?
                            
                                Git Commit Generation Numbers
                            
                                Is "final" final at runtime?
                            
                                IE9 table has random rows which are offset at random columns
                            
                                Strptime with Timezone
                            
                                HDF5 Example code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With