I've written a program (in C#) that reads and manipulates MSIL programs that have been generated from C# programs. I had mistakenly assumed that the syntax rules for MSIL string constants are the same as for C#, but then I ran into the following situation:
This C# statement
string s = "Do you wish to send anyway?";
gets compiled into (among other MSIL statements) this
IL_0128: ldstr "Do you wish to send anyway\?"
I wasn't expecting the backslash that is used to escape the question mark. Now I can obviously take this backslash into account as part of my processing, but mostly out of curiosity I'd like to know if there is a list somewhere of which characters get escaped when the C# compiler converts C# constant strings to MSIL constant strings.
Thanks.
String literals may contain any valid characters, including escape sequences such as \n, \t, etc. Octal and hexadecimal escape sequences are technically legal in string literals, but not as commonly used as they are in character constants, and have some potential problems of running on into following text.
C# includes escaping character \ (backslash) before these special characters to include in a string. Use backslash \ before double quotes and some special characters such as \,\n,\r,\t, etc. to include it in a string.
Update
Based on experimentation using the C# compiler + ildasm.exe: perhaps the reason there is no list of escaped characters is because there are so few: precisely 6.
Going from the IL generated by ildasm, from C# programs compiled by Visual Studio 2010:
\t
: 0x09 : (tab)\n
: 0x0A : (newline)\r
: 0x0D : (carriage return)\"
: 0x22 : (double quote)\?
: 0x3F : (question mark)\\
: 0x5C : (backslash)Example 1: ASCII above 0x7E: A simple accented é (U+00E9)
C#: Either "é"
or "\u00E9"
becomes (E9
byte comes first)
ldstr bytearray (E9 00 )
Example 2: UTF-16: Summation symbol ∑ (U+2211)
C#: Either "∑"
or "\u2211"
becomes (11
byte comes first)
ldstr bytearray (11 22 )
Example 3: UTF-32: Double-struck mathematical 𝔸 (U+1D538)
C#: Either "𝔸"
or UTF-16 surrogate pair "\uD835\uDD38"
becomes (bytes within char reversed, but double-byte chars in overall order)
ldstr bytearray (35 D8 38 DD )
Example 4: Byte array conversion is for an entire string containing a non-Ascii character
C#: "In the last decade, the German word \"über\" has come to be used frequently in colloquial English."
becomes
ldstr bytearray (49 00 6E 00 20 00 74 00 68 00 65 00 20 00 6C 00
61 00 73 00 74 00 20 00 64 00 65 00 63 00 61 00
64 00 65 00 2C 00 20 00 74 00 68 00 65 00 20 00
47 00 65 00 72 00 6D 00 61 00 6E 00 20 00 77 00
6F 00 72 00 64 00 20 00 22 00 FC 00 62 00 65 00
72 00 22 00 20 00 68 00 61 00 73 00 20 00 63 00
6F 00 6D 00 65 00 20 00 74 00 6F 00 20 00 62 00
65 00 20 00 75 00 73 00 65 00 64 00 20 00 66 00
72 00 65 00 71 00 75 00 65 00 6E 00 74 00 6C 00
79 00 20 00 69 00 6E 00 20 00 63 00 6F 00 6C 00
6C 00 6F 00 71 00 75 00 69 00 61 00 6C 00 20 00
45 00 6E 00 67 00 6C 00 69 00 73 00 68 00 2E 00 )
Directly, "you can't" (find a list of MSIL string escapes), but here are some helpful tidbits...
ECMA-335, which contains the strict definition of CIL, does not specify which characters must be escaped in QSTRING literals, only that they may be escaped using the backslash \
character. The most important notes are:
\042
, not \u0022
).\
character--see below The only explicitly mentioned escapes are tab \t
, linefeed \n
, and octal numeric escapes. This is a bit annoying for you purposes since C# does not have an octal literal -- you'll have to do your own extraction and conversion, such as by using the Convert.ToInt32([string], 8)
method.
Beyond that the choice of escapes is "implementation-specific" to the "hypothetical IL assembler" described in the spec. So your question rightly asks about the rules for MSIL, which is Microsoft's strict implementation of CIL. As far as I can tell, MS has not documented their choice of escapes. It could be helpful at least to ask the Mono folks what they use. Beyond that, it may be a matter of generating the list yourself -- make a program that declares a string literal for every character \u0000
- whatever, and see what the compiled ldstr
statements are. If I get to it first, I'll be sure to post my results.
Additional notes:
To properly parse *IL string literals -- known as QSTRINGS or SQSTRINGS -- you will have to account for more than just character escapes. Take in-code string concatenation, for example (and this is verbatim from Partition II::5.2):
The "+" operator can be used to concatenate string literals. This way, a long string can be broken across multiple lines by using "+" and a new string on each line. An alternative is to use "\" as the last character in a line, in which case, that character and the line break following it are not entered into the generated string. Any white space characters (space, line-feed, carriage-return, and tab) between the "\" and the first non-white space character on the next line are ignored. [Note: To include a double quote character in a QSTRING, use an octal escape sequence. end note]
Example: The following result in strings that are equivalent to "Hello World from CIL!":
ldstr "Hello " + "World " + "from CIL!"
ldstr "Hello World\
\040from CIL!"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With