After I heard of someone at my work using <code>String.toLowerCase()</code> to store case-insensitive codes in a database for searchability, I had an epic fail moment thinking about the number of ways that it can go wrong: <ul> <li>Turkey test (in particular changing locales on the running computer)</li> <li> Unicode version upgrades - I mean, who knows about this stuff? If I upgrade to Java 7, I have to reindex my data if I'm being case-insensitive?</li> </ul> What technologies are affected by Unicode versions? Do I need to worry about Oracle or SQL Server (or other vendors) changing their unicode versions and resulting in one of my locales not resulting in the same lower or upper character conversion? How do I manage this? I'm tempted by the "simplicity" of ensuring I use the database conversion, but when there's an upgrade it'll be the same sort of issue.

You do not want to store the lowercase version of a string "for searchability"!! That is the wrong approach altogether. You are making unjust and incorrect assumptions about how Unicode casing works. This is why Unicode defines a separate thing called a casefold for a string, distinct from the three different cases (lowercase, titlecase, and uppercase). Here are ten different examples where you will do the wrong thing if you use the lowercase instead of the casefold: <pre class="prettyprint"><code>ORIGINAL CASEFOLD LOWERCASE TITLECASE UPPERCASE ======================================================================== eﬃcient efficient eﬃcient Eﬃcient EFFICIENT ﬂour flour ﬂour Flour FLOUR poſt post poſt Poſt POST poﬅ post poﬅ Poﬅ POST ﬅop stop ﬅop Stop STOP tschüß tschüss tschüß Tschüß TSCHÜSS weiß weiss weiß Weiß WEISS WEIẞ weiss weiß Weiß WEIẞ στιγμα&sigmaf; στιγμασ στιγμα&sigmaf; Στιγμα&sigmaf; ΣΤΙΓΜΑΣ ᾲ στο διάολο ὰι στο διάολο ᾲ στο διάολο Ὰͅ Στο Διάολο ᾺΙ ΣΤΟ ΔΙΆΟΛΟ </code></pre> And yes, I know the plural of stigma is stigmata not stigmas; I am trying to show the final sigma issue. Both &sigmaf; and σ are valid lowercase versions of the uppercase sigma, Σ. If you store “just the lowercase”, then you will get the wrong thing. If you are using Java’s <code>Pattern</code> class, you must specify both <code>CASE_INSENSITIVE</code> and <code>UNICODE_CASE</code>, and you still will not get these right, because while Java uses full casemapping, it uses only simple casefolding. This is a problem. As for the Turkic languages, yes, it is true that there is a special casefold for Turkic. For example, İstanbul has a Turkic casefold of just ı̇stanbul instead of the i̇stanbul that you are supposed to get. Since I am sure those will not look right to you, I’ll spell it out with named characters for the non-ASCII; in plainer terms, <code>"\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}stanbul"</code> has a Turkic casefold of <code>"\N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING DOT ABOVE}stanbul"</code> rather than <code>"i\N{COMBINING DOT ABOVE}stanbul"</code> that you normally get. Here are a couple more table rows if you’re writing a regression testing suite: <pre class="prettyprint"><code>[ "Henry Ⅷ", "henry ⅷ", "henry ⅷ", "Henry Ⅷ", "HENRY Ⅷ", ], [ "I Work At Ⓚ", "i work at ⓚ", "i work at ⓚ", "I Work At Ⓚ", "I WORK AT Ⓚ", ], [ "ʀᴀʀᴇ", "ʀᴀʀᴇ", "ʀᴀʀᴇ", "Ʀᴀʀᴇ", "ƦᴀƦᴇ", ], [ "Ԧԧ", "ԧԧ", "ԧԧ", "Ԧԧ", "ԦԦ", ], [ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐇𐐝𐐀𐐡𐐇𐐓", ], [ "Ὰͅ", "ὰι", "ᾲ", "Ὰͅ", "ᾺΙ", ], </code></pre> Where each column is orig, fold, lc, tc, and uc, just as I had in the earlier table above. Notice again how the last row has a casefold that is different from its lowercase.

Case-insensitive storage and unicode compatibility

Tags:

unicode

compatibility

After I heard of someone at my work using String.toLowerCase() to store case-insensitive codes in a database for searchability, I had an epic fail moment thinking about the number of ways that it can go wrong:

Turkey test (in particular changing locales on the running computer)
Unicode version upgrades - I mean, who knows about this stuff? If I upgrade to Java 7, I have to reindex my data if I'm being case-insensitive?

What technologies are affected by Unicode versions?

Do I need to worry about Oracle or SQL Server (or other vendors) changing their unicode versions and resulting in one of my locales not resulting in the same lower or upper character conversion?

How do I manage this? I'm tempted by the "simplicity" of ensuring I use the database conversion, but when there's an upgrade it'll be the same sort of issue.

816

asked Aug 09 '11 03:08

Stephen

2 Answers

You do not want to store the lowercase version of a string "for searchability"!!

That is the wrong approach altogether. You are making unjust and incorrect assumptions about how Unicode casing works.

This is why Unicode defines a separate thing called a casefold for a string, distinct from the three different cases (lowercase, titlecase, and uppercase).

Here are ten different examples where you will do the wrong thing if you use the lowercase instead of the casefold:

ORIGINAL        CASEFOLD        LOWERCASE   TITLECASE  UPPERCASE
========================================================================
eﬃcient         efficient       eﬃcient       Eﬃcient         EFFICIENT       
ﬂour            flour           ﬂour           Flour           FLOUR           
poſt            post            poſt           Poſt            POST            
poﬅ             post            poﬅ             Poﬅ            POST            
ﬅop             stop            ﬅop            Stop            STOP            
tschüß          tschüss         tschüß         Tschüß         TSCHÜSS         
weiß            weiss           weiß           Weiß            WEISS           
WEIẞ            weiss           weiß            Weiß           WEIẞ            
στιγμας         στιγμασ         στιγμας         Στιγμας         ΣΤΙΓΜΑΣ 
ᾲ στο διάολο    ὰι στο διάολο   ᾲ στο διάολο    Ὰͅ Στο Διάολο   ᾺΙ ΣΤΟ ΔΙΆΟΛΟ

And yes, I know the plural of stigma is stigmata not stigmas; I am trying to show the final sigma issue. Both ς and σ are valid lowercase versions of the uppercase sigma, Σ. If you store “just the lowercase”, then you will get the wrong thing.

If you are using Java’s Pattern class, you must specify both CASE_INSENSITIVE and UNICODE_CASE, and you still will not get these right, because while Java uses full casemapping, it uses only simple casefolding. This is a problem.

As for the Turkic languages, yes, it is true that there is a special casefold for Turkic. For example, İstanbul has a Turkic casefold of just ı̇stanbul instead of the i̇stanbul that you are supposed to get. Since I am sure those will not look right to you, I’ll spell it out with named characters for the non-ASCII; in plainer terms, "\N{LATIN CAPITAL LETTER I WITH DOT ABOVE}stanbul" has a Turkic casefold of "\N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING DOT ABOVE}stanbul" rather than "i\N{COMBINING DOT ABOVE}stanbul" that you normally get.

Here are a couple more table rows if you’re writing a regression testing suite:

[ "Henry Ⅷ", "henry ⅷ", "henry ⅷ", "Henry Ⅷ", "HENRY Ⅷ",  ],
[ "I Work At Ⓚ",  "i work at ⓚ",  "i work at ⓚ", "I Work At Ⓚ", "I WORK AT Ⓚ", ],
[ "ʀᴀʀᴇ", "ʀᴀʀᴇ", "ʀᴀʀᴇ", "Ʀᴀʀᴇ", "ƦᴀƦᴇ",  ],
[ "Ԧԧ", "ԧԧ", "ԧԧ", "Ԧԧ", "ԦԦ",   ],
[ "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐯𐑅𐐨𐑉𐐯𐐻", "𐐔𐐇𐐝𐐀𐐡𐐇𐐓",   ],
[ "Ὰͅ", "ὰι", "ᾲ", "Ὰͅ", "ᾺΙ",  ],

Where each column is orig, fold, lc, tc, and uc, just as I had in the earlier table above. Notice again how the last row has a casefold that is different from its lowercase.

answered Sep 20 '22 11:09

tchrist

Specify a locale for toLowerCase() instead of using the system default. This protects against changes to the system locale.

As for possible unicode changes in future version of Java, I don't think it's worth writing code to handle this. Document that the product supports Java 6 and move on to a feature that your customers actually want.

answered Sep 19 '22 11:09

Matthew Gatland

Related questions
                            
                                Specification of source charset encoding in MSVC++, like gcc "-finput-charset=CharSet"
                            
                                Map between LaTeX commands and Unicode points
                            
                                Change Font Size Based on Language
                            
                                Combine my own unicode characters in c#?
                            
                                What is the Windows equivalent for en_US.UTF-8 locale?
                            
                                Choosing a binary collation that can differentiate between 'ss' and 'ß' for nvarchar column in Sql Server
                            
                                How do I find the length of a Unicode string in Perl?
                            
                                how to use chinese and japanese character as string in java?
                            
                                declaring a unicode string in vba in excel [duplicate]
                            
                                How to get the character from unicode code point in PHP?
                            
                                Writing a string to a TFileStream in Delphi 2010
                            
                                TextPad and Unicode: full support?
                            
                                How to Decode "=?utf-8?B?...?=" to string in C#
                            
                                number of digits in a hex escape code in C/C++
                            
                                Remove non-ASCII characters from a string using python / django
                            
                                How do I get SQLAlchemy to correctly insert a unicode ellipsis into a mySQL table?
                            
                                How to convert string to unicode(UTF-8) string in Swift?
                            
                                Regular Expression for Japanese characters
                            
                                UnicodeDecodeError in Python 3 when importing a CSV file
                            
                                Unicode string in XML

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With