Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does String.ToLowerInvariant() determine to what string/character it must convert?

Tags:

c#

unicode

As we know Unicode was invented to solve codepage problem and to represent all characters of all (well not all but most) languages of the world. Next we have unicode transformation formats - how to represent unicode character in computer bytes:

  • utf-8 one character can take from 1 to 4 bytes
  • utf-16 one character takes 2 bytes, or 2*2bytes = 4bytes (.NET uses this)
  • utf-32 one character always takes 4 bytes (I heard Python uses this)

So far, ok. Next we take for example two languages:

English in united kingdom (en-GB) and slovenian in Slovenia (sl-SI). English has next characters: a, b, c, d, e, ... x, y, z. Slovene has the same characters except x,y and it has additional characters: č, š, ž. If I run below code:

Thread.CurrentThread.CurrentCulture = new CultureInfo("sl-SI");
string upperCase = "č".ToUpper(); // returns Č, which is correct based on sl-SI culture

// returns Č, how does it know that it must convert č to Č. 
// What if some other language has character č, and č in that language converts to X.
// How does it determine to what character it must convert?
Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
string upperCase1 = "č".ToUpperInvariant();

We can take turkish example: Lowercase “i” becomes “İ” (U+0130 “Latin Capital Letter I With Dot Above”) when it moves to uppercase. Similarly, our uppercase “I” becomes “ı” (U+0131 “Latin Small Letter Dotless I”) when it moves to lowercase.

to upper

to lower

What if ToUpperInvariant() determines to convert "i" to turkish "İ" and not "I". Is then invariant culture english. Out of scope of this question but, do all languages of the world have upper case for each lower case character? I assume yes, but if they don't, is there a language that has only upper case characters. Yes I know I should go from \u+0000 to \u+FFFF to test this.

like image 572
broadband Avatar asked Sep 27 '22 13:09

broadband


1 Answers

The invariant culture is a fake culture based on English, so all "Invariant" conversions will be based on the English ones.

Do all languages of the world have upper case for each lower case character?

No, they don't. For example, Chinese languages do not have the concept of upper and lower case.

And German has the letter ß, which does not have an uppercase version.

Consider:

var germanCulture = new CultureInfo("de-DE");

System.Threading.Thread.CurrentThread.CurrentCulture   = germanCulture;
System.Threading.Thread.CurrentThread.CurrentUICulture = germanCulture;

string s = "ß";

Console.WriteLine(s.ToUpper()); // Prints ß
Console.WriteLine(s.ToLower()); // Prints ß

// Aside: There's a special "uppercase" ß, but this isn't
// returned from "ß".ToUpper();

string t = "ẞ"; // Special "uppercase" ß.

Console.WriteLine(t == s); // Prints false.

Console.WriteLine(s.ToUpper() == t); // Prints false.

(See here for details about the strange uppercase ß () which isn't returned from "ß".ToUpper().)

like image 182
Matthew Watson Avatar answered Sep 30 '22 07:09

Matthew Watson