Been recently reading on casefold and string comparisons when ignoring case. I've read that the MSDN standard is to use InvariantCulture and definitely avoid toLowercase. However, casefold from what I have read is like a more aggressive toLowercase. My question is should I use casefold in Python or is there a more pythonic standard to use instead? Also, does casefold pass the Turkey Test?
Python String casefold() method is used to convert string to lower case. It is similar to lower() string method, but case removes all the case distinctions present in a string.
The casefold() method is similar to the lower() method but it is more aggressive. This means the casefold() method converts more characters into lower case compared to lower() . For example, the German letter ß is already lowercase so, the lower() method doesn't make the conversion.
The casefold() method returns a string where all the characters are in lower case. It is similar to the lower() method, but the casefold() method converts more characters into lower case. For example, the German lowercase letter 'ß' is equivalent to 'ss' .
Case-Sensitive Names in Python The same rule applies to function names. To avoid problems with case-sensitive functions and variable names, use lowercase names with underscores between words for readability (e.g., user_name ) as stated in the official Python documentation.
1) In Python 3, casefold()
should be used to implement caseless string matching.
Starting with Python 3.0, strings are stored as Unicode. The Unicode Standard Chapter 3.13 defines the default caseless matching as follows:
A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)
Python's casefold()
implements the Unicode's toCasefold()
. So, it should be used to implement caseless string matching. Although, casefolding alone is not enough to cover some corner cases and to pass the Turkey Test (see Point 3).
2) As of Python 3.6, casefold() cannot pass the Turkey Test.
For two characters, uppercase I and dotted uppercase I, the Unicode Standard defines two different casefolding mappings.
The default (for non-Turkic languages):
I → i (U+0049 → U+0069)
İ → i̇ (U+0130 → U+0069 U+0307)
The alternative (for Turkic languages):
I → ı (U+0049 → U+0131)
İ → i (U+0130 → U+0069)
Pythons casefold()
can apply only the default mapping and fails the Turkey Test. For example, the Turkish words "LİMANI" and "limanı" are caseless equivalents, but "LİMANI".casefold() == "limanı".casefold()
returns False
. There is no option to enable the alternative mapping.
3) How to do caseless string matching in Python 3.
The Unicode Standard Chapter 3.13 describes several caseless matching algorithms. The canonical casless matching would probably suit most use cases. This algorithm already takes into account all corner cases. We only need to add an option to switch between non-Turkic and Turkic casefolding.
import unicodedata
def normalize_NFD(string):
return unicodedata.normalize('NFD', string)
def casefold_(string, include_special_i=False):
if include_special_i:
string = unicodedata.normalize('NFC', string)
string = string.replace('\u0049', '\u0131')
string = string.replace('\u0130', '\u0069')
return string.casefold()
def casefold_NFD(string, include_special_i=False):
return normalize_NFD(casefold_(normalize_NFD(string), include_special_i))
def caseless_match(string1, string2, include_special_i=False):
return casefold_NFD(string1, include_special_i) == casefold_NFD(string2, include_special_i)
casefold_()
is a wrapper for Python's casefold()
. If its parameter include_special_i
is set to True
, then it applies the Turkic mapping, and if it is set to False
the default mapping is used.
caseless_match()
does the canonical casless matching for string1
and string2
. If the strings are Turkic words, include_special_i
parameter must be set to True
.
Examples:
>>> caseless_match('LİMANI', 'limanı', include_special_i=True)
True
>>> caseless_match('LİMANI', 'limanı')
False
>>> caseless_match('INTENSIVE', 'intensive', include_special_i=True)
False
>>> caseless_match('INTENSIVE', 'intensive')
True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With