Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex and Capital I in some cultures

Tags:

What is wrong with capital 'I' in some cultures? I found that in some cultures in can't be found in special conditions - if you are looking for [a-z] with flag RegexOptions.IgnoreCase. Here is sample code:

var allCultures = CultureInfo.GetCultures(CultureTypes.AllCultures); var allLetters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"; var allLettersCount = allLetters.Length;  foreach (var culture in allCultures) {     Thread.CurrentThread.CurrentCulture = culture;     Thread.CurrentThread.CurrentUICulture = culture;      var matched = string.Empty;     foreach (var m in Regex.Matches(allLetters, "[A-Za-z0-9]", RegexOptions.IgnoreCase))         matched += m;      var count = matched.Length;     if (count != allLettersCount)         Console.WriteLine("Culture '{0}' - {1} missing; Matched: {2}", culture.Name, (allLettersCount - count).ToString(), matched); } 

Output is (notice missing capital I in every line):

Culture 'az' - 1 missing; Matched:          ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 Culture 'az-Cyrl' - 1 missing; Matched:     ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 Culture 'az-Cyrl-AZ' - 1 missing; Matched:  ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 Culture 'az-Latn' - 1 missing; Matched:     ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 Culture 'az-Latn-AZ' - 1 missing; Matched:  ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 Culture 'tr' - 1 missing; Matched:          ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 Culture 'tr-TR' - 1 missing; Matched:       ABCDEFGHJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 

Interesting is that if flag "IgnoreCase" is not used then it works well, and finds "I".

like image 848
MiroJanosik Avatar asked Apr 16 '15 12:04

MiroJanosik


1 Answers

The answer is in Wikipedia:

The casing of the dotless and dotted I forms differ from other languages. That implies that a case insensitive matching expected by an English person doesn't match the expectations of a Turkish user. The "Turkish I" is often used as an example of the problems with case insensitivity in computing.

And another explanation can be found on MSDN:

enter image description here

like image 195
Wiktor Stribiżew Avatar answered Oct 05 '22 04:10

Wiktor Stribiżew