I would like to match all lowercase letter forms in the Latin block. The trivial '[a-z]' only matches characters between U+0061 and U+007A, and not all the other lowercase forms.
I would like to match all lowercase letters, most importantly, all the accented lowercase letters in the Latin block used in EFIGS languages.
[a-zà-ý] is a start, but there are still tons of other lowercase characters (see http://www.unicode.org/charts/PDF/U0000.pdf). Is there a recommended way of doing this?
FYI I'm using Python, but I suspect that this problem is cross-language.
Python's builtin "islower()" method seems to do the right checking:
lower = '' for c in xrange(0,2**16): if unichr(c).islower(): lower += unichr(c) print lower
To match a specific Unicode code point, use \uFFFF where FFFF is the hexadecimal number of the code point you want to match. You must always specify 4 hexadecimal digits E.g. \u00E0 matches à, but only when encoded as a single code point U+00E0.
?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).
As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.
Python does not currently support Unicode properties in regular expressions. See this answer for a link to the Ponyguruma library which does support them.
Using such a library, you could use \p{Ll}
to match any lowercase letter in a Unicode string.
Every character in the Unicode standard is in exactly one category. \p{Ll}
is the category of lowercase letters, while \p{L}
comprises all the characters in one of the "Letter" categories (Letter, uppercase; Letter, lowercase; Letter, titlecase; Letter, modifier; and Letter, other). For more information see the Character Properties chapter of the Unicode Standard. Or see this page for a good explanation on use of Unicode in regular expressions.
Looks as though this recipe posted back in the old 2005
import sys, re uppers = [u'['] for i in xrange(sys.maxunicode): c = unichr(i) if c.isupper(): uppers.append(c) uppers.append(u']') uppers = u"".join(uppers) uppers_re = re.compile(uppers) print uppers_re.match('A')
is still relevant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With