Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# "anyString".Contains('\0', StringComparison.InvariantCulture) returns true in .NET5 but false in older versions

Tags:

c#

.net-5

c#-9.0

I encountered an incompatible problem while I was trying to upgrade my projects from .NET core 3.1 to the latest .NET 5.

My original code has a validation logic to check invalid file name characters by checking each character returned from Path.GetInvalidFileNameChars() API.


var invalidFilenameChars = Path.GetInvalidFileNameChars();
bool validFileName = !invalidFilenameChars.Any(ch => fileName.Contains(ch, StringComparison.InvariantCulture));

Suppose you give a regular value to fileName such as "test.txt" that should be valid. Surprisingly, however, the above code gives the file name is invalid if you run it with 'net5' target framework.

After spend some time on debugging, what I found is that the returned invalid character set contains '\0', null ASCII character and "text.txt".Contains("\0, StringComparison.InvariantCulture) gives true.

    class Program
    {
        static void Main(string[] args)
        {
            var containsNullChar = "test".Contains("\0", StringComparison.InvariantCulture);
    
            Console.WriteLine($"Contains null char {containsNullChar}");
        }
    }

If you run in .NET core 3.1, it never says regular string contains null character. Also, if I omit the second parameter (StringComparison.InvariantCulture) or if I use StringComparison.Ordinal, the strange result is never returned.

Why this behavior is changed in .NET5?

EDIT: As commented by Karl-Johan Sjögren before, there is indeed a behavior change in .NET5 regarding string comparison:

Behavior changes when comparing strings on .NET 5+

Also see the related ticket:

string.IndexOf get different result in .Net 5

Though this issue should be related to above, the current result related to '\0' still looks strange to me and might still be considered to be a bug as answered by @xanatos.

EDIT2:

Now I realized that the actual cause of this problem was my confusion between InvariantCulture and Ordinal string comparison. They are actually quite different things. See the ticket below:

Difference between InvariantCulture and Ordinal string comparison

Also note that this should be unique problem of .NET as other major programming languages such as Java, C++ and Python treat ordinal comparison by default.

like image 764
Ryo Asai Avatar asked Jan 05 '21 07:01

Ryo Asai


Video Answer


1 Answers

not a bug, a feature

The issue that I've opened has been closed, but they gave a very good explanation. Now... In .NET 5.0 they began using on Windows (on Linux it was already present) a new library for comparing strings, the ICU library. It is the official library of the Unicode Consortium, so it is "the verb". That library is used for CurrentCulture, InvariantCulture (plus the respective IgnoreCase) and and any other culture. The only exception is the Ordinal/OrdinalIgnoreCase. The library is targetted for text and it has some "particular" ideas about non-text. In this particular case, there are some characters that are simply ignored. In the block 0000-00FF I would say the ignored characters are all control codes (please ignore the fact that they are shown as €‚ƒ„†‡ˆ‰Š‹ŒŽ‘’“”•–—™š›œžŸ, at a certain point these characters have been remapped somewhere else in the Unicode, but the glyps shown don't reflect it, but if you try to see their code, like doing char ch = '€'; int val = (int)ch; you'll see it), and '\0' is a control code.

Now... My personal thinking is that to compare string from today you'll need a master's degree in Unicode Technologies 😥, and I do hope that they'll do some shenanigans in .NET 6.0 to make the default comparison Ordinal (it is one of the proposals for .NET 6.0, the Option B). Note that if you want to make programs that can run in Turkey you already needed a master's degree in Unicode Technologies (see the Turkish i problem).

In general I would say that to look for words that aren't keywords/fixed words (for example column names), you should use Culture-aware comparisons, while to look for keywords/fixed words (for example column names) and symbols/control codes you should use Ordinal comparisons. The problem is when you want to look for both at the same time. Normally in this case you are looking for exact words, so you can use Ordinal. Otherwise it becames hellish. And I don't even want to think how Regex works internally in a Culture-aware environment. That I don't want to think about. Becasue in that direction there can only be folly and nightmares 😁.

As a sidenote, even before the "default" Culture-aware comparisons had some secret shaeaningans... for example:

int ix = "ʹ$ʹ".IndexOf("$"); // -1 on .NET Framework or .NET Core <= 3.1

what I had written before

I'll say that it is a bug. There is a similar bug with IndexOf. I've opened an Issue on github to track it.

As you have written, the Ordinal and OrdinalIgnoreCase work as expected (probably because they don't need to use the new ICU library for handling Unicode).

Some sample code:

Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.InvariantCultureIgnoreCase)}");

Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0t", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCultureIgnoreCase)}");

and

Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0test", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0test", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0test", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.InvariantCultureIgnoreCase)}");

Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0t", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.OrdinalIgnoreCase)}");

Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0test", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.CurrentCultureIgnoreCase)}");

Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0test", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.InvariantCultureIgnoreCase)}");
like image 92
xanatos Avatar answered Oct 11 '22 21:10

xanatos