Unicode characters having asymmetric upper/lower case. Why?

Tags:

Why do the following three characters have not symmetric toLower, toUpper results

/**
  * Written in the Scala programming language, typed into the Scala REPL.
  * Results commented accordingly.
  */
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"

290

asked Sep 20 '11 20:09

Tim Friske

2 Answers

For the first one, there is this explanation:

In the German language, the Sharp S ("ß" or U+00df) is a lowercase letter, and it capitalizes to the letters "SS".

In other words, U+1E9E lower-cases to U+00DF, but the upper-case of U+00DF is not U+1E9E.

For the second one, U+212A (KELVIN SIGN) lower-cases to U+0068 (LATIN SMALL LETTER K). The upper-case of U+0068 is U+004B (LATIN CAPITAL LETTER K). This one seems to make sense to me.

For the third case, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) is a Turkish/Azerbaijani character that lower-cases to U+0069 (LATIN SMALL LETTER I). I would imagine that if you were somehow in a Turkish/Azerbaijani locale you'd get the proper upper-case version of U+0069, but that might not necessarily be universal.

Characters need not necessarily have symmetric upper- and lower-case transformations.

Edit: To respond to PhiLho's comment below, the Unicode 6.0 spec has this to say about U+212A (KELVIN SIGN):

Three letterlike symbols have been given canonical equivalence to regular letters: U+2126 OHM SIGN, U+212A KELVIN SIGN, and U+212B ANGSTROM SIGN. In all three instances, the regular letter should be used. If text is normalized according to Unicode Standard Annex #15, “Unicode Normalization Forms,” these three characters will be replaced by their regular equivalents.

In other words, you shouldn't really be using U+212A, you should be using U+004B (LATIN CAPITAL LETTER K) instead, and if you normalize your Unicode text, U+212A should be replaced with U+004B.

answered Oct 05 '22 02:10

CanSpice

May I refer to another post about Unicode and upper and lower case.. It is a common mistake to think that signs for a language have to be available in upper and lower case!

Unicode-correct title case in Java

answered Oct 05 '22 01:10

definitely undefinable

Related questions
                            
                                Is ED A0 80 ED B0 80 a valid UTF-8 byte sequence?
                            
                                How can I specify the encoding of Java source files?
                            
                                What does the expression \X match when inside a RegEx?
                            
                                In haskell how can I uppercase a unicode character with respect to current locale
                            
                                comfortable way to use unicode characters in a ggplot graph
                            
                                Python the same char not equals
                            
                                RichTextBox cannot display Unicode Mathematical alphanumeric symbols
                            
                                Can regular expressions work with different languages?
                            
                                TSQL Prefixing String Literal on Insert - Any Value to This, or Redundant?
                            
                                Python encoding for pipe.communicate
                            
                                Encoding for Multilingual .py Files
                            
                                Attempted exploit?
                            
                                Concatenating Unicode with string: print '£' + '1' works, but print '£' + u'1' throws UnicodeDecodeError
                            
                                rules for slugs and unicode
                            
                                Allowed characters in CSS 'content' property?
                            
                                How does Java distinguish these multiple methods with the same name/signature?
                            
                                Raw unicode literal that is valid in Python 2 and Python 3?
                            
                                python unicode rendering: how to know if a unicode character is missing from the font
                            
                                Is it possible to show unicode characters in a HTML input type=submit value?
                            
                                StreamReader is unable to correctly read extended character set (UTF8)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode characters having asymmetric upper/lower case. Why?

Tags:

uppercase

lowercase

unicode

case-conversion

symmetry

Tim Friske

People also ask

2 Answers

CanSpice

definitely undefinable

Recent Activity

Donate For Us