Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching case sensitive unicode strings with regular expressions in Python

Suppose I want to match a lowercase letter followed by an uppercase letter, I could do something like

re.compile(r"[a-z][A-Z]")

Now I want to do the same thing for unicode strings, i.e. match something like 'aÅ' or 'yÜ'.

Tried

re.compile(r"[a-z][A-Z]", re.UNICODE)

but that does not work.

Any clues?

like image 894
repoman Avatar asked Sep 13 '11 06:09

repoman


People also ask

Does regex work with Unicode?

RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0.

Are regex matches case sensitive?

By default, the comparison of an input string with any literal characters in a regular expression pattern is case-sensitive, white space in a regular expression pattern is interpreted as literal white-space characters, and capturing groups in a regular expression are named implicitly as well as explicitly.

How do you handle case sensitive in regex?

In Java, by default, the regular expression (regex) matching is case sensitive. To enable the regex case insensitive matching, add (?) prefix or enable the case insensitive flag directly in the Pattern. compile() .

Is regex case sensitive in Python?

In this article, we will learn about how to use Python Regex to validate name using IGNORECASE. re. IGNORECASE : This flag allows for case-insensitive matching of the Regular Expression with the given string i.e. expressions like [A-Z] will match lowercase letters, too.


1 Answers

This is hard to do with Python regex because the current implementation doesn't support Unicode property shortcuts like \p{Lu} and \p{Ll}.

[A-Za-z] will of course only match ASCII letters, regardless of whether the Unicode option is set or not.

So until the re module is updated (or you install the regex package currently in development), you either need to do it programmatically (iterate through the string and do char.islower()/char.isupper() on the characters), or specify all the unicode code points manually which probably isn't worth the effort...

like image 89
Tim Pietzcker Avatar answered Oct 19 '22 08:10

Tim Pietzcker