Should I use Python casefold?

Tags:

Been recently reading on casefold and string comparisons when ignoring case. I've read that the MSDN standard is to use InvariantCulture and definitely avoid toLowercase. However, casefold from what I have read is like a more aggressive toLowercase. My question is should I use casefold in Python or is there a more pythonic standard to use instead? Also, does casefold pass the Turkey Test?

592

asked Oct 31 '16 18:10

FlyingLightning

1 Answers

1) In Python 3, casefold() should be used to implement caseless string matching.

Starting with Python 3.0, strings are stored as Unicode. The Unicode Standard Chapter 3.13 defines the default caseless matching as follows:

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

Python's casefold() implements the Unicode's toCasefold(). So, it should be used to implement caseless string matching. Although, casefolding alone is not enough to cover some corner cases and to pass the Turkey Test (see Point 3).

2) As of Python 3.6, casefold() cannot pass the Turkey Test.

For two characters, uppercase I and dotted uppercase I, the Unicode Standard defines two different casefolding mappings.

The default (for non-Turkic languages):
I → i (U+0049 → U+0069)
İ → i̇ (U+0130 → U+0069 U+0307)

The alternative (for Turkic languages):
I → ı (U+0049 → U+0131)
İ → i (U+0130 → U+0069)

Pythons casefold() can apply only the default mapping and fails the Turkey Test. For example, the Turkish words "LİMANI" and "limanı" are caseless equivalents, but "LİMANI".casefold() == "limanı".casefold() returns False. There is no option to enable the alternative mapping.

3) How to do caseless string matching in Python 3.

The Unicode Standard Chapter 3.13 describes several caseless matching algorithms. The canonical casless matching would probably suit most use cases. This algorithm already takes into account all corner cases. We only need to add an option to switch between non-Turkic and Turkic casefolding.

import unicodedata

def normalize_NFD(string):
    return unicodedata.normalize('NFD', string)

def casefold_(string, include_special_i=False):
    if include_special_i:
        string = unicodedata.normalize('NFC', string)
        string = string.replace('\u0049', '\u0131')
        string = string.replace('\u0130', '\u0069')
    return string.casefold()

def casefold_NFD(string, include_special_i=False):
    return normalize_NFD(casefold_(normalize_NFD(string), include_special_i))

def caseless_match(string1, string2, include_special_i=False):
    return  casefold_NFD(string1, include_special_i) == casefold_NFD(string2, include_special_i)

casefold_() is a wrapper for Python's casefold(). If its parameter include_special_i is set to True, then it applies the Turkic mapping, and if it is set to False the default mapping is used.

caseless_match() does the canonical casless matching for string1 and string2. If the strings are Turkic words, include_special_i parameter must be set to True.

Examples:

>>> caseless_match('LİMANI', 'limanı', include_special_i=True)
True

>>> caseless_match('LİMANI', 'limanı')
False

>>> caseless_match('INTENSIVE', 'intensive', include_special_i=True)
False

>>> caseless_match('INTENSIVE', 'intensive')
True

answered Sep 21 '22 12:09

SergiyKolesnikov

Related questions
                            
                                Flask-admin - how to change formatting of columns - get URLs to display
                            
                                Merge two or more lists with given order of merging
                            
                                What is the "format" parameter used for in Django REST Framework views?
                            
                                Convert a list to json objects
                            
                                How to implement Poisson Regression?
                            
                                How to pass const char* from python to c function
                            
                                How to play mp3 from URL
                            
                                exposing C++ class in Python ( only ET_DYN and ET_EXEC can be loaded)
                            
                                pandas describe by - additional parameters
                            
                                How to replace invalid unicode characters in a string in Python?
                            
                                Python: how to retain the file extension when renaming files with os?
                            
                                Ansible - grab a key from a dictionary (but not in a loop)
                            
                                No module named 'model_utils'
                            
                                replacing quotes, commas, apostrophes w/ regex - python/pandas
                            
                                What is the difference between 'with open(...)' and 'with closing(open(...))'
                            
                                how to count consecutive duplicates in a python list [duplicate]
                            
                                Regex: match only outside parenthesis (so that the text isn't split within parenthesis)?
                            
                                Django 1.9 to 1.10 raises NoReverseMatch: u'en-gb' is not a registered namespace
                            
                                python: loop a list of list and assign value inside the loop
                            
                                Test Setup and Teardown for each test case in a test suite in Robot Framework using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should I use Python casefold?

Tags:

python

python-3.x

case-folding

FlyingLightning

People also ask

1 Answers

SergiyKolesnikov

Recent Activity

Donate For Us