I'm writing a test for a function aimed to remove invalid code points such as orphaned surrogate pairs. However, I'm seeing a difference in the way the surrogate pair is being encoded depending on how I write the test. While this version of the test passes: <pre class="prettyprint"><code> [TestCategory("UnitTest")] [TestMethod] public void RemoveOrhpanedSurrogatePair() { var input = "\uDDDD1975"; var cleanText = input.ReplaceInvalidCodePoints(); Assert.AreEqual(input.Length - 1, cleanText.Length); Assert.AreEqual("1975", cleanText); } </code></pre> This one does not: <pre class="prettyprint lang-cs prettyprint-override"><code> [TestCategory("UnitTest")] [TestMethod] [DataRow("\uDDDD1975")] public void RemoveOrhpanedSurrogatePair(string input) { var cleanText = input.ReplaceInvalidCodePoints(); Assert.AreEqual(input.Length - 1, cleanText.Length); Assert.AreEqual("1975", cleanText); } </code></pre> Looking at the debugger, the first variation encoded the string as <code>"\uDDDD1975"</code> but the second one produces <code>"��1975"</code> which appears as two valid characters instead of one orphaned surrogate pair.

I think a clue to the answer can be found in (what else but) a @jonskeet blog post. Apparently C# uses UTF16 to encode strings everywhere, except in Attribute c'tors where UTF8 is being used. The compiler seems to see that this is an orphaned surrogate pair and treats it via its UTF8 value as two invalid Unicode characters. Those are then being replaced by a pair of <code>\uFFFD</code> characters (the Unicode replacement character which is used to indicate broken data when decoding binary to text). <pre class="prettyprint lang-cs prettyprint-override"><code>[Description(Value)] class Test { const string Value = "\uDDDD"; static void Main() { var description = (DescriptionAttribute) typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0]; DumpString("Attribute", description.Description); DumpString("Constant", Value); } static void DumpString(string name, string text) { var utf16 = text.Select(c => ((uint) c).ToString("x4")); Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16)); } } </code></pre> Will produce: <pre class="prettyprint"><code>Attribute: fffd fffd Constant: dddd </code></pre>

C#: Different string encoding on attribute vs. constant

Tags:

I'm writing a test for a function aimed to remove invalid code points such as orphaned surrogate pairs. However, I'm seeing a difference in the way the surrogate pair is being encoded depending on how I write the test.

While this version of the test passes:

        [TestCategory("UnitTest")]
        [TestMethod]
        public void RemoveOrhpanedSurrogatePair()
        {
            var input = "\uDDDD1975";
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

This one does not:

        [TestCategory("UnitTest")]
        [TestMethod]
        [DataRow("\uDDDD1975")]
        public void RemoveOrhpanedSurrogatePair(string input)
        {
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

Looking at the debugger, the first variation encoded the string as "\uDDDD1975" but the second one produces "��1975" which appears as two valid characters instead of one orphaned surrogate pair.

667

asked Dec 31 '20 03:12

Assaf Israel

1 Answers

I think a clue to the answer can be found in (what else but) a @jonskeet blog post. Apparently C# uses UTF16 to encode strings everywhere, except in Attribute c'tors where UTF8 is being used. The compiler seems to see that this is an orphaned surrogate pair and treats it via its UTF8 value as two invalid Unicode characters. Those are then being replaced by a pair of \uFFFD characters (the Unicode replacement character which is used to indicate broken data when decoding binary to text).

[Description(Value)]
class Test
{
    const string Value = "\uDDDD";
 
    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }
 
    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

Will produce:

Attribute: fffd fffd
Constant: dddd

172

answered Sep 30 '22 17:09

Assaf Israel

Related questions
                            
                                initial `reactive ` value to null
                            
                                Getting Failed to execute 'createObjectURL' on 'URL': Overload resolution failed with npm file-saver
                            
                                SwiftUI Picker manually trigger expansion
                            
                                Pan Responder only fires once when updating parent component's state?
                            
                                Why can i assign a value to a rvalue reference?
                            
                                On x86-64, is the “movnti” or "movntdq" instruction atomic when system crash?
                            
                                App crash with error: Unable to find JSIModule for class UIManager, after touch event on ad
                            
                                Jetpack Compose, custom cursor position in TextField
                            
                                Remove query from cache without refetching react query
                            
                                Where can I find ECS and Fargate in the AWS Pricing Calculator?
                            
                                gnuplot: How to get correct week numbers?
                            
                                How to implement a fast fuzzy-search engine using BK-trees when the corpus has 10 billion unique DNA sequences?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

C#: Different string encoding on attribute vs. constant

Tags:

Assaf Israel

People also ask

1 Answers

Assaf Israel

Recent Activity

Donate For Us