Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace Unicode escape sequences in a string [duplicate]

Tags:

c#

.net

We have one text file which has the following text

"\u5b89\u5fbd\u5b5f\u5143"

When we read the file content in C# .NET it shows like:

"\\u5b89\\u5fbd\\u5b5f\\u5143"

Our decoder method is

public string Decoder(string value)
{
    Encoding enc = new UTF8Encoding();
    byte[] bytes = enc.GetBytes(value);
    return enc.GetString(bytes);
}

When I pass a hard coded value,

string Output=Decoder("\u5b89\u5fbd\u5b5f\u5143");

it works well, but when we use a variable value it is not working.

When we use the string this is what we get from the text file:

  value=(text file content)
  string Output=Decoder(value);

It returns the wrong output.

How can I fix this?

like image 277
PrateekSaluja Avatar asked Mar 16 '12 13:03

PrateekSaluja


3 Answers

Use the below code. This unescapes any escaped characters from the input string

Regex.Unescape(value);
like image 152
Sagar Avatar answered Oct 22 '22 12:10

Sagar


You could use a regular expression to parse the file:

private static Regex _regex = new Regex(@"\\u(?<Value>[a-zA-Z0-9]{4})", RegexOptions.Compiled);

public string Decoder(string value)
{
    return _regex.Replace(
        value,
        m => ((char)int.Parse(m.Groups["Value"].Value, NumberStyles.HexNumber)).ToString()
    );
}

And then:

string data = Decoder(File.ReadAllText("test.txt"));
like image 8
Darin Dimitrov Avatar answered Oct 22 '22 12:10

Darin Dimitrov


So your file contains the verbatim string

\u5b89\u5fbd\u5b5f\u5143

in ASCII and not the string represented by those four Unicode codepoints in some given encoding?

As it happens, I just wrote some code in C# that can parse strings in this format for a JSON parser project -- here's a variant that only handles \uXXXX escapes:

private static string ReadSlashedString(TextReader reader) {
    var sb = new StringBuilder(32);
    bool q = false;
    while (true) {
        int chrR = reader.Read();

        if (chrR == -1) break;
        var chr = (char) chrR;

        if (!q) {
            if (chr == '\\') {
                q = true;
                continue;
            }
            sb.Append(chr);
        }
        else {
            switch (chr) {
                case 'u':
                case 'U':
                    var hexb = new char[4];
                    reader.Read(hexb, 0, 4);
                    chr = (char) Convert.ToInt32(new string(hexb), 16);
                    sb.Append(chr);
                    break;
                default:
                    throw new Exception("Invalid backslash escape (\\ + charcode " + (int) chr + ")");
            }
            q = false;
        }
    }
    return sb.ToString();
}

And you could use it like:

var str = ReadSlashedString(new StringReader("\\u5b89\\u5fbd\\u5b5f\\u5143"));

(or using a StreamReader to read from a file).

Darin Dimitrov's regexp-utilizing answer is probably faster, but I happened to have this code at hand. :)

like image 3
AKX Avatar answered Oct 22 '22 13:10

AKX