Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: How to check if a unicode string contains a cased character?

I'm doing a filter wherein I check if a unicode (utf-8 encoding) string contains no uppercase characters (in all languages). It's fine with me if the string doesn't contain any cased character at all.

For example: 'Hello!' will not pass the filter, but "!" should pass the filter, since "!" is not a cased character.

I planned to use the islower() method, but in the example above, "!".islower() will return False.

According to the Python Docs, "The python unicode method islower() returns True if the unicode string's cased characters are all lowercase and the string contained at least one cased character, otherwise, it returns False."

Since the method also returns False when the string doesn't contain any cased character, ie. "!", I want to do check if the string contains any cased character at all.

Something like this....

string = unicode("!@#$%^", 'utf-8')

#check first if it contains cased characters
if not contains_cased(string):
     return True

return string.islower():

Any suggestions for a contains_cased() function?

Or probably a different implementation approach?

Thanks!

like image 966
Albert Avatar asked Aug 18 '10 02:08

Albert


People also ask

How do you check if a character is uppercase or lowercase in Python?

To check if a character is upper-case, we can simply use isupper() function call on the said character.

How do you check if all characters in string are capitalized Python?

In Python, isupper() is a built-in method used for string handling. This method returns True if all characters in the string are uppercase, otherwise, returns “False”.

How do I find non ascii characters in a string Python?

isascii() will check if the strings is ascii. "\x03". isascii() is also True. The documentation says this just checks that all characters are below code point 128 (0-127).

How do I check if a string is Unicode?

The isLetter(int codePoint) method determines whether the specific character (Unicode codePoint) is a letter. It returns a boolean value, either true or false. Here, the parameter codePoint represents the character to be checked. The charAt() method returns a character value at a given index.


2 Answers

Here is the full scoop on Unicode character categories.

Letter categories include:

Ll -- lowercase
Lu -- uppercase
Lt -- titlecase
Lm -- modifier
Lo -- other

Note that Ll <-> islower(); similarly for Lu; (Lu or Lt) <-> istitle()

You may wish to read the complicated discussion on casing, which includes some discussion of Lm letters.

Blindly treating all "letters" as cased is demonstrably wrong. The Lo category includes 45301 codepoints in the BMP (counted using Python 2.6). A large chunk of these would be Hangul Syllables, CJK Ideographs, and other East Asian characters -- very hard to understand how they might be considered "cased".

You might like to consider an alternative definition, based on the (unspecified) behaviour of "cased characters" that you expect. Here's a simple first attempt:

>>> cased = lambda c: c.upper() != c or c.lower() != c
>>> sum(cased(unichr(i)) for i in xrange(65536))
1970
>>>

Interestingly there are 1216 x Ll and 937 x Lu, a total of 2153 ... scope for further investigation of what Ll and Lu really mean.

like image 115
John Machin Avatar answered Oct 26 '22 01:10

John Machin


import unicodedata as ud

def contains_cased(u):
  return any(ud.category(c)[0] == 'L' for c in u)
like image 43
Alex Martelli Avatar answered Oct 25 '22 23:10

Alex Martelli