Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert Unicode escape sequences to Unicode characters in a .NET string?

Tags:

c#

.net

unicode

Say you've loaded a text file into a string, and you'd like to convert all Unicode escapes into actual Unicode characters inside of the string.

Example:

"The following is the top half of an integral character in Unicode '\u2320', and this is the lower half '\U2321'."

like image 246
jr. Avatar asked Oct 08 '08 17:10

jr.


People also ask

What is unicode escape encoding?

A unicode escape sequence is a backslash followed by the letter 'u' followed by four hexadecimal digits (0-9a-fA-F). It matches a character in the target sequence with the value specified by the four digits. For example, ”\u0041“ matches the target sequence ”A“ when the ASCII character encoding is used.

Are C# strings unicode?

The equivalent in C# is the String class. According to MSDN: (A String) Represents text as a series of Unicode characters. So, if you do string str = "a string here"; , you have a Unicode string.

What escape sequence defines a unicode character in Java?

Unicode characters can be expressed through Unicode Escape Sequences. Unicode escape sequences consist of. a backslash '\' (ASCII character 92, hex 0x5c), a 'u' (ASCII 117, hex 0x75)

How do you escape a unicode character in Python?

Unicode Literals in Python Source Code Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.


1 Answers

The answer is simple and works well with strings up to at least several thousand characters.

Example 1:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" ); result = rx.Replace( result, match => ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString() ); 

Example 2:

Regex  rx = new Regex( @"\\[uU]([0-9A-F]{4})" ); result = rx.Replace( result, delegate (Match match) { return ((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); } ); 

The first example shows the replacement being made using a lambda expression (C# 3.0) and the second uses a delegate which should work with C# 2.0.

To break down what's going on here, first we create a regular expression:

new Regex( @"\\[uU]([0-9A-F]{4})" ); 

Then we call Replace() with the string 'result' and an anonymous method (lambda expression in the first example and the delegate in the second - the delegate could also be a regular method) that converts each regular expression that is found in the string.

The Unicode escape is processed like this:

((char) Int32.Parse(match.Value.Substring(2), NumberStyles.HexNumber)).ToString(); }); 

Get the string representing the number part of the escape (skip the first two characters).

match.Value.Substring(2) 

Parse that string using Int32.Parse() which takes the string and the number format that the Parse() function should expect which in this case is a hex number.

NumberStyles.HexNumber 

Then we cast the resulting number to a Unicode character:

(char) 

And finally we call ToString() on the Unicode character which gives us its string representation which is the value passed back to Replace():

.ToString() 

Note: Instead of grabbing the text to be converted with a Substring call you could use the match parameter's GroupCollection, and a subexpressions in the regular expression to capture just the number ('2320'), but that's more complicated and less readable.

like image 141
jr. Avatar answered Sep 19 '22 20:09

jr.