How do you set strings to uppercase / lowercase in Unicode?

Tags:

This is mostly a theoretical question I'm just very curious about. (I'm not trying to do this by coding it myself or anything, I'm not reinventing wheels.)

My question is how the uppercase/lowercase table of equivalence works for Unicode.

For example, if I had to do this in ASCII, I'd take a character, and if it falls withing the [a-z] range, I'd sum the difference between A and a.

If it doesn't fall on that range, I'd have a small equivalence table for the 10 or so accented characters plus ñ. (Or, I could just have a full equivalence array with 256 entries, most of which would be the same as the input)

However, I'm guessing that there's a better way of specifying the equivalences in Unicode, given that there are hundreds of thousands of characters, and that theoretically, a new language or set of characters can be added (and I'm expecting that you wouldn't need to patch windows when that happens).

Does Windows have a huge hard-coded equivalence table for each character? Or how is this implemented?

A related question is how SQL Server implements Unicode-based accent-insensitive and case-insensitive queries. Does it have an internal table that tells it that é ë è E É È and Ë are all equivalent to "e"?

That doesn't sound very fast when it comes to comparing strings.

How does it access Indexes quickly? Does it already index values converted to their "base" characters, corresponding to that field's collation?

Does anyone know the internals for these things?

Thank you!

678

asked Nov 18 '08 02:11

Daniel Magliola

1 Answers

I'm going to address the MS SQL Server part of this question, but the "correct" answer actually depends on the language(s) supported and application.

When you create a table in SQL Server, each text field has either an implicitly or explicitly specified collation. This affects both sort order and comparison behavior. The default, for most English (US) locales, is Latin1_General_CI_AS, or Latin 1, Case-insensitive, Accent-Sensitive. That means that, for example, a=A, but a!=Ä and a!=ä. You can also use accent-insensitive (Latin1_General_CI_AI) which treats all the diacritic variations of "A" as equal.

Some locales support other categories of comparison; for example, French orders words containing diacritics somewhat differently than German does. Turkish considers a dotless i and dotted i semantically different, so I and i don't match even with case-insensitive comparisons if you use Turkish, case-insensitive, accent-sensitive collation.

You can change the collation per database, per table, per field, and, with some cost, even per-query. My understanding is that indices normalize according to the specified collation order, which means that basically the index keeps a flattened version of the original string. For example, with case-insensitive collations, Apple and apple are stored as apple. Queries are flattened with the same collation before the search.

In Japanese, there's another category of normalization, where fullwidth and halfwidth characters like ア=ｱ, and in some cases, two halfwidth characters are flattened to a single, semantically equivalent character (バ=ﾊﾞ). Finally, for some languages, there's another ball of wax with composite characters, where isolated diacritic characters can be composed with other characters (e.g. the umlaut in ä is one character, composed with the simple form a). Vietnamese, Thai and a few other languages have variations of this category. If there's a canonical form, Unicode normalization allows the composed and decomposed forms to be treated as equivalent. Unicode normalization is typically applied before any comparisons are made.

To summarize, for a case-insensitive comparison, you do something much like you would when comparing ASCII-range strings: flatten the left and right side of the comparison "to lower case" (for example), then compare the array as a binary array. The difference is that you need to 1) normalize the strings to the same unicode form (kC or kD) 2) normalize the strings to the same case according to the rules of that locale 3) normalize the accents according to the accent-sensitivity rules 4) compare according to a binary comparison 4) if applicable, such as in the case of sorting, compare using additional secondary and ternary sorting rules, which include things analogous to things like "Mc" sorts before "M" in some languages.

And yes, Windows stores tables for all of these rules. You don't get all of them by default in every installation, unless you add support for them with the East Asian Language Support and Complex Scripts support from control panel.

120

answered Oct 08 '22 04:10

JasonTrue

Related questions
                            
                                Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT) within stored procedure
                            
                                How can I determine Levenshtein distance for Mandarin Chinese characters?
                            
                                Expressing UTF-16 unicode characters in JavaScript
                            
                                Python: handle broken unicode bytes when parsing JSON string
                            
                                Matching (e.g.) a Unicode letter with Java regexps
                            
                                Python : UnicodeEncodeError: 'latin-1' codec can't encode character
                            
                                Why does to_json escape unicode automatically in Rails 4?
                            
                                perl: Uncaught exception: malformed UTF-8 character in JSON string
                            
                                Convert fullwidth to halfwidth
                            
                                Swift countElements() return incorrect value when count flag emoji
                            
                                How can I cin and cout some unicode text?
                            
                                pandas to_sql gives unicode decode error
                            
                                Replace all accented characters by their LaTeX equivalent
                            
                                Unicode input retrieved via PrimeFaces input components become corrupted
                            
                                How to use five digit long Unicode characters in JavaScript
                            
                                How do I convert a unicode to a string at the Python level?
                            
                                Convert unicode codepoint to UTF8 hex in python
                            
                                Storing unicode UTF-8 string in std::string
                            
                                How to convert utf-8 fancy quotes to neutral quotes
                            
                                CSS - change dropdown arrow to unicode triangle

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you set strings to uppercase / lowercase in Unicode?

Tags:

string

uppercase

unicode

theory

low-level

Daniel Magliola

People also ask

1 Answers

JasonTrue

Recent Activity

Donate For Us