Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unsafe string manipulation mutates unexisting value

Tags:

string

c#

.net

string in C# is a reference type that behave like value type. Usually programmers do not have to worry about this since strings are immutable and the language design prevents us from doing unintentional dangerous things with them. However, with the use of unsafe pointer logic is it possible to directly manipulate the underlying value of a string, like so:

    class Program
    {
        static string foo = "FOO";
        static string bar = "FOO";
        const string constFoo = "FOO";

        static unsafe void Main(string[] args)
        {
            fixed (char* p = foo)
            {
                for (int i = 0; i < foo.Length; i++)
                    p[i] = 'M';
            }
            Console.WriteLine($"foo = {foo}"); //MMM
            Console.WriteLine($"bar = {bar}"); //MMM
            Console.WriteLine($"constFoo = {constFoo}"); //FOO
        }
   }

When run the compiler will optimize(interning) the strings so that both foo and bar points to the same underlying value. By manipulating foo this way we also change the value of bar. The const value is inlined by the compiler and is not affected by this. Nothing strange thus far.

Let us change the fixed variable from foo to constFoo and we start seeing some strange behaviour.

    class Program
    {
        static string foo = "FOO";
        static string bar = "FOO";
        const string constFoo = "FOO";

        static unsafe void Main(string[] args)
        {
            fixed (char* p = constFoo)
            {
                for (int i = 0; i < constFoo.Length; i++)
                    p[i] = 'M';
            }
            Console.WriteLine($"foo = {foo}"); //MMM
            Console.WriteLine($"bar = {bar}"); //MMM
            Console.WriteLine($"constFoo = {constFoo}"); //FOO
        }
    }

Despite it being constFoo that we fixed and manipulated it is the value foo and bar that are mutated. Why are foo and bar being mutated?

It gets even more strange if we now change the value of foo and bar.

    class Program
    {
        static string foo = "BAR";
        static string bar = "BAR";
        const string constFoo = "FOO";

        static unsafe void Main(string[] args)
        {
            fixed (char* p = constFoo)
            {
                for (int i = 0; i < constFoo.Length; i++)
                    p[i] = 'M';
            }
            Console.WriteLine($"foo = {foo}"); //BAR
            Console.WriteLine($"bar = {bar}"); //BAR
            Console.WriteLine($"constFoo = {constFoo}"); //FOO
        }
    }

The code runs and we appear to mutate something somewhere but there are no change to our variables. What are we mutating in this code?

like image 482
Okami Avatar asked Aug 29 '19 08:08

Okami


2 Answers

You are modifying the string in the interned string table, as the following code demonstrates:

using System;

namespace CoreApp1
{
    class Program
    {
        const string constFoo = "FOO";

        static unsafe void Main(string[] args)
        {
            fixed (char* p = constFoo)
            {
                for (int i = 0; i < constFoo.Length; i++)
                    p[i] = 'M';
            }

            // Madness ensues: The next line prints "MMM":
            Console.WriteLine("FOO"); // Prints the interned value of "FOO" which is now "MMM"
        }
    }
}

Here's something a little harder to explain:

using System;
using System.Runtime.InteropServices;

namespace CoreApp1
{
    class Program
    {
        const string constFoo = "FOO";

        static void Main()
        {
            char[] chars = new StringToChar {str = constFoo }.chr;

            for (int i = 0; i < constFoo.Length; i++)
            {
                chars[i] = 'M';
                Console.WriteLine(chars[i]); // Always prints "M".
            }

            Console.WriteLine("FOO"); // x86: Prints "MMM". x64: Prints "FOM".
        }
    }

    [StructLayout(LayoutKind.Explicit)]
    public struct StringToChar
    {
        [FieldOffset(0)] public string str;
        [FieldOffset(0)] public char[] chr;
    }
}

This doesn't use any unsafe code, but it still mutates the string in the intern table.

What's harder to explain here is that for x86 the interned string is changed to "MMM" as you'd expect, but for x64 it gets changed to "FOM". What happened to the changes to the first two characters? I can't explain this, but I'm guessing it's to do with fitting two characters into a word for x64 rather than just one.

like image 69
Matthew Watson Avatar answered Nov 07 '22 21:11

Matthew Watson


To help you understand this, you can decompile the assembly and inspect the IL code.

Taking your second snippet, you will get something like this:

// static fields initialization
.method specialname static void .cctor () cil managed 
{
    IL_0000: ldstr "FOO"
    IL_0005: stsfld string Program::foo

    IL_000a: ldstr "FOO"
    IL_000f: stsfld string Program::bar
}

.method static void Main() cil managed 
{
    .entrypoint
    .locals init (
        [0] char* p,
        [1] string pinned,
        // ...
    )

    // fixed (char* ptr = "FOO")
    IL_0001: ldstr "FOO"
    IL_0006: stloc.1
    IL_0007: ldloc.1
    IL_0008: conv.u
    IL_0009: stloc.0
    // ...
}

Note that in all three cases, the string is loaded onto the evaluation stack using the ldstr opcode.

From the documentation:

The Common Language Infrastructure (CLI) guarantees that the result of two ldstr instructions referring to two metadata tokens that have the same sequence of characters return precisely the same string object (a process known as "string interning").

So in all three cases, you get the same string object - the interned string instance. This explains the "mutated" const object.

like image 43
dymanoid Avatar answered Nov 07 '22 21:11

dymanoid