Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does a space preceding a non-combining diacritic function differently when using IndexOf(string) and IndexOf(char)?

I am creating a substring from a string with non-combining diacritics that follow a space. When doing so, I check the string with .Contains() and then perform the substring. When I use a space char inside of an .IndexOf(), the program performs as expected, yet when using the string " ", within .IndexOf() the program throws an exception. As shown in the samples below only a string preceding the primary stress diacritic (U+02C8) throws an ArgumentOutOfRangeException.

Simple code (Edit suggested by John):

string a = "aɪ prɪˈzɛnt";
string b = "maɪ ˈprɛznt";

// A            
Console.WriteLine(a.IndexOf(" ")); // string index:  2
Console.WriteLine(a.IndexOf(' ')); // char index:    2

// B    
Console.WriteLine(b.IndexOf(" ")); // string index: -1
Console.WriteLine(b.IndexOf(' ')); // char index:    3

Sample code I tested with:

        const string iPresent = "aɪ prɪˈzɛnt",
                     myPresent = "maɪ ˈprɛznt";

        if(iPresent.Contains(' '))
        {
            Console.WriteLine(iPresent.Substring(0, iPresent.IndexOf(' ')));
        }

        if(iPresent.Contains(" "[0]))
        {
            Console.WriteLine(iPresent.Substring(0, iPresent.IndexOf(" "[0])));
        }

        if(iPresent.Contains(" "))
        {
            Console.WriteLine(iPresent.Substring(0, iPresent.IndexOf(" ")));
        }

        if(iPresent.Contains(string.Empty + ' '))
        {
            Console.WriteLine(iPresent.Substring(0, iPresent.IndexOf(string.Empty + ' ')));
        }

        if (myPresent.Contains(' '))
        {
            Console.WriteLine(myPresent.Substring(0, myPresent.IndexOf(' ')));
        }

        if (myPresent.Contains(" "[0]))
        {
            Console.WriteLine(myPresent.Substring(0, myPresent.IndexOf(" "[0])));
        }

        if (myPresent.Contains(string.Empty + ' '))
        {
            try
            {
                Console.WriteLine(myPresent.Substring(0, myPresent.IndexOf(string.Empty + ' ')));
            }
            catch (Exception ex)
            {
                Console.WriteLine("***" + ex.Message);
            }
        }

        if (myPresent.Contains(" "))
        {
            try
            {
                Console.WriteLine(myPresent.Substring(0, myPresent.IndexOf(" ")));
            }
            catch (Exception ex)
            {
                Console.WriteLine("***" + ex.Message);
            }
        }
like image 689
Tristong Avatar asked Jun 30 '20 23:06

Tristong


1 Answers

IndexOf(string) does something different from IndexOf(char), because IndexOf(char)...

...performs an ordinal (culture-insensitive) search, where a character is considered equivalent to another character only if their Unicode scalar values are the same.

whereas IndexOf(string)...

performs a word (case-sensitive and culture-sensitive) search using the current culture.

So it's a whole lot "smarter" than IndexOf(char) because it takes into account the string comparison rules of the current culture. This is why it doesn't find the space character.

After some testing in other languages and platforms, I suspect this is a bug of .NET Framework. Because in .NET Core 3.1, b.IndexOf(" ") doesn't return -1... Neither does b.IndexOf(' ', StringComparison.CurrentCulture). Other languages/platforms where "maɪ ˈprɛznt" contains a space culture-sensitively include:

  • Mono 6
  • Swift 5

Passing in StringComparison.Ordinal works:

b.IndexOf(" ", StringComparison.Ordinal)

But do note that you lose the smartness of culture-sensitive comparison.

like image 124
Sweeper Avatar answered Oct 22 '22 00:10

Sweeper