I'm doing a filter wherein I check if a unicode (utf-8 encoding) string contains no uppercase characters (in all languages). It's fine with me if the string doesn't contain any cased character at all.
For example: 'Hello!' will not pass the filter, but "!" should pass the filter, since "!" is not a cased character.
I planned to use the islower() method, but in the example above, "!".islower() will return False.
According to the Python Docs, "The python unicode method islower() returns True if the unicode string's cased characters are all lowercase and the string contained at least one cased character, otherwise, it returns False."
Since the method also returns False when the string doesn't contain any cased character, ie. "!", I want to do check if the string contains any cased character at all.
Something like this....
string = unicode("!@#$%^", 'utf-8')
#check first if it contains cased characters
if not contains_cased(string):
return True
return string.islower():
Any suggestions for a contains_cased() function?
Or probably a different implementation approach?
Thanks!
To check if a character is upper-case, we can simply use isupper() function call on the said character.
In Python, isupper() is a built-in method used for string handling. This method returns True if all characters in the string are uppercase, otherwise, returns “False”.
isascii() will check if the strings is ascii. "\x03". isascii() is also True. The documentation says this just checks that all characters are below code point 128 (0-127).
The isLetter(int codePoint) method determines whether the specific character (Unicode codePoint) is a letter. It returns a boolean value, either true or false. Here, the parameter codePoint represents the character to be checked. The charAt() method returns a character value at a given index.
Here is the full scoop on Unicode character categories.
Letter categories include:
Ll -- lowercase
Lu -- uppercase
Lt -- titlecase
Lm -- modifier
Lo -- other
Note that Ll <-> islower()
; similarly for Lu
; (Lu or Lt) <-> istitle()
You may wish to read the complicated discussion on casing, which includes some discussion of Lm
letters.
Blindly treating all "letters" as cased is demonstrably wrong. The Lo
category includes 45301 codepoints in the BMP (counted using Python 2.6). A large chunk of these would be Hangul Syllables, CJK Ideographs, and other East Asian characters -- very hard to understand how they might be considered "cased".
You might like to consider an alternative definition, based on the (unspecified) behaviour of "cased characters" that you expect. Here's a simple first attempt:
>>> cased = lambda c: c.upper() != c or c.lower() != c
>>> sum(cased(unichr(i)) for i in xrange(65536))
1970
>>>
Interestingly there are 1216 x Ll and 937 x Lu, a total of 2153 ... scope for further investigation of what Ll and Lu really mean.
import unicodedata as ud
def contains_cased(u):
return any(ud.category(c)[0] == 'L' for c in u)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With