To compare two strings case insensitively, one correct way is to case fold them first. How is this better than upper casing or lower casing? I find examples where lower casing doesn't work right online. For example "σ" and "&sigmaf;" (two forms of "Σ") don't become the same when converted to lower case. But I've failed to find why case folding is better than mapping to upper case. Is there a case where two strings that should match case insensitively don't upper case to the same strings? Another scenario is when I want to store a case insensitive index. The recommended way seems to be case folding and then normalizing. What are its advantages over storing the string mapped to upper case and normalized? The specs say mapping to upper case is not guaranteed to be stable across versions of Unicode while case folding is. But are there any cases where mapping to upper case gives a different string in an earlier version of Unicode?

As per Unicode stability policy, case mappings are only stable for case pairs, i.e. pairs of characters X and Y where X is the full uppercase mapping of Y, and Y is the full lowercase mapping of X. Only when both these characters exist with these properties is the casing relation between them set in stone. However, Unicode contains many “incomplete” case pairs where only the lowercase form has been encoded and the uppercase form is missing completely. This is usually the case for letters used in transcription systems that are traditionally lowercase-only. Should capital forms be discovered and subsequently added to Unicode, these letters would then receive a new uppercase mapping. The most recent characters this has happened to are “ʂ” (from Unicode 1.1), “ᶎ” (from Unicode 4.1), and “ꞔ” (from Unicode 7.0), which all got brand new uppercase forms (Ꞔ, Ʂ, Ᶎ) in Unicode 12.0 two years ago. Because case mappings do not have to be unique, this makes uppercasing a poor substitute for proper case-folding. For example, both U+0434 (д) and U+1C81 (ᲁ) uppercase to U+0414 (Д), but only the former is locked into a case pair by virtue of being U+0414’s full lowercase mapping. If someone were to find a dedicated capital letter version of U+1C81 in some old manuscript, it would be given a new uppercase mapping, resulting in U+0434 and U+1C81 suddenly no longer comparing equal under that operation. EDIT: I have just remembered a current example of uppercasing not being sufficient for case-insensitive matching: U+1E9E (ẞ) is already a capital letter and thus uppercases to itself. Its lowercase counterpart is U+00DF (ß), but the uppercase mapping of U+00DF is the sequence <U+0053, U+0053> (SS). <pre class="prettyprint"><code>uppercase("ẞ") ≠ uppercase(lowercase("ẞ")) </code></pre>

Why is upper casing not enough for case-insensitive comparison?

Tags:

case-insensitive

unicode

case-folding

To compare two strings case insensitively, one correct way is to case fold them first. How is this better than upper casing or lower casing?

I find examples where lower casing doesn't work right online. For example "σ" and "ς" (two forms of "Σ") don't become the same when converted to lower case. But I've failed to find why case folding is better than mapping to upper case. Is there a case where two strings that should match case insensitively don't upper case to the same strings?

Another scenario is when I want to store a case insensitive index. The recommended way seems to be case folding and then normalizing. What are its advantages over storing the string mapped to upper case and normalized? The specs say mapping to upper case is not guaranteed to be stable across versions of Unicode while case folding is. But are there any cases where mapping to upper case gives a different string in an earlier version of Unicode?

447

asked Apr 15 '21 10:04

93Iq2Gg2cZtLMO

Video Answer

1 Answers

As per Unicode stability policy, case mappings are only stable for case pairs, i.e. pairs of characters X and Y where X is the full uppercase mapping of Y, and Y is the full lowercase mapping of X. Only when both these characters exist with these properties is the casing relation between them set in stone.

However, Unicode contains many “incomplete” case pairs where only the lowercase form has been encoded and the uppercase form is missing completely. This is usually the case for letters used in transcription systems that are traditionally lowercase-only. Should capital forms be discovered and subsequently added to Unicode, these letters would then receive a new uppercase mapping.

The most recent characters this has happened to are “ʂ” (from Unicode 1.1), “ᶎ” (from Unicode 4.1), and “ꞔ” (from Unicode 7.0), which all got brand new uppercase forms (Ꞔ, Ʂ, Ᶎ) in Unicode 12.0 two years ago.

Because case mappings do not have to be unique, this makes uppercasing a poor substitute for proper case-folding. For example, both U+0434 (д) and U+1C81 (ᲁ) uppercase to U+0414 (Д), but only the former is locked into a case pair by virtue of being U+0414’s full lowercase mapping. If someone were to find a dedicated capital letter version of U+1C81 in some old manuscript, it would be given a new uppercase mapping, resulting in U+0434 and U+1C81 suddenly no longer comparing equal under that operation.

EDIT: I have just remembered a current example of uppercasing not being sufficient for case-insensitive matching: U+1E9E (ẞ) is already a capital letter and thus uppercases to itself. Its lowercase counterpart is U+00DF (ß), but the uppercase mapping of U+00DF is the sequence <U+0053, U+0053> (SS).

uppercase("ẞ") ≠ uppercase(lowercase("ẞ"))

131

answered Oct 18 '22 20:10

CharlotteBuff

Related questions
                            
                                How can I use regular expression for unicode string in python?
                            
                                Convert full-width Japanese text to half-width (zen-kaku to han-kaku)
                            
                                How to get the length of Japanese characters in Javascript?
                            
                                Unicode characters in Django usernames
                            
                                Unicode error Ordinal not in range
                            
                                How do I remove the last character of an R-T-L string in python?
                            
                                Convert numeric strings to superscript
                            
                                Unicode alternative for <wbr> tag
                            
                                Arabic, Unicode and files in python
                            
                                Run a commandline process and get the output while that process is still running?
                            
                                Is u'string' the same as 'string'.decode('XXX')
                            
                                Unicode Regex in Scala REPL
                            
                                Using Emoji literals in Clojure source
                            
                                Making a website in urdu
                            
                                Trie for Unicode character set
                            
                                Python 0xff byte
                            
                                Display Unicode Emoji in PowerShell
                            
                                How to display a colored emoji
                            
                                How do I count unique grapheme clusters in a string in Rust?
                            
                                get all unicode variations of a latin character

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With