Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C#: Different string encoding on attribute vs. constant

Tags:

I'm writing a test for a function aimed to remove invalid code points such as orphaned surrogate pairs. However, I'm seeing a difference in the way the surrogate pair is being encoded depending on how I write the test.

While this version of the test passes:

        [TestCategory("UnitTest")]
        [TestMethod]
        public void RemoveOrhpanedSurrogatePair()
        {
            var input = "\uDDDD1975";
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

This one does not:

        [TestCategory("UnitTest")]
        [TestMethod]
        [DataRow("\uDDDD1975")]
        public void RemoveOrhpanedSurrogatePair(string input)
        {
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

Looking at the debugger, the first variation encoded the string as "\uDDDD1975" but the second one produces "��1975" which appears as two valid characters instead of one orphaned surrogate pair.

like image 667
Assaf Israel Avatar asked Dec 31 '20 03:12

Assaf Israel


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is C full form?

Originally Answered: What is the full form of C ? C - Compiler . C is a general-purpose, high-level language that was originally developed by Dennis M. Ritchie to develop the UNIX operating system at Bell Labs. C was originally first implemented on the DEC PDP-11 computer in 1972.

How old is the letter C?

The letter c was applied by French orthographists in the 12th century to represent the sound ts in English, and this sound developed into the simpler sibilant s.

What is C language basics?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.


1 Answers

I think a clue to the answer can be found in (what else but) a @jonskeet blog post. Apparently C# uses UTF16 to encode strings everywhere, except in Attribute c'tors where UTF8 is being used. The compiler seems to see that this is an orphaned surrogate pair and treats it via its UTF8 value as two invalid Unicode characters. Those are then being replaced by a pair of \uFFFD characters (the Unicode replacement character which is used to indicate broken data when decoding binary to text).

[Description(Value)]
class Test
{
    const string Value = "\uDDDD";
 
    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }
 
    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

Will produce:

Attribute: fffd fffd
Constant: dddd
like image 172
Assaf Israel Avatar answered Sep 30 '22 17:09

Assaf Israel