Can we simplify this string encoding code

Question

Is it possible to simplify this code into a cleaner/faster form?

StringBuilder builder = new StringBuilder();
var encoding = Encoding.GetEncoding(936);

// convert the text into a byte array
byte[] source = Encoding.Unicode.GetBytes(text);

// convert that byte array to the new codepage. 
byte[] converted = Encoding.Convert(Encoding.Unicode, encoding, source);

// take multi-byte characters and encode them as separate ascii characters 
foreach (byte b in converted)
    builder.Append((char)b);

// return the result
string result = builder.ToString();

Simply put, it takes a string with Chinese characters such as 鄆 and converts them to ài.

For example, that Chinese character in decimal is 37126 or 0x9106 in hex.

See http://unicodelookup.com/#0x9106/1

Converted to a byte array, we get [145, 6] (145 * 256 + 6 = 37126). When encoded in CodePage 936 (simplified chinese), we get [224, 105]. If we break this byte array down into individual characters, we 224=e0=à and 105=69=i in unicode.

See http://unicodelookup.com/#0x00e0/1 and http://unicodelookup.com/#0x0069/1

Thus, we're doing an encoding conversion and ensuring that all characters in our output Unicode string can be represented using at most two bytes.

Update: I need this final representation because this is the format my receipt printer is accepting. Took me forever to figure it out! :) Since I'm not an encoding expert, I'm looking for simpler or faster code, but the output must remain the same.

Update (Cleaner version):

return Encoding.GetEncoding("ISO-8859-1").GetString(Encoding.GetEncoding(936).GetBytes(text));

Eamon Nerbonne · Accepted Answer

Well, for one, you don't need to convert the "built-in" string representation to a byte array before calling Encoding.Convert.

You could just do:

byte[] converted = Encoding.GetEncoding(936).GetBytes(text);

To then reconstruct a string from that byte array whereby the char values directly map to the bytes, you could do...

static string MangleTextForReceiptPrinter(string text) {
    return new string(
        Encoding.GetEncoding(936)
            .GetBytes(text)
            .Select(b => (char) b)
            .ToArray());
}

I wouldn't worry too much about efficiency; how many MB/sec are you going to print on a receipt printer anyhow?

Joe pointed out that there's an encoding that directly maps byte values 0-255 to code points, and it's age-old Latin1, which allows us to shorten the function to...

return Encoding.GetEncoding("Latin1").GetString(
           Encoding.GetEncoding(936).GetBytes(text)
       );

By the way, if this is a buggy windows-only API (which it is, by the looks of it), you might be dealing with codepage 1252 instead (which is almost identical). You might try reflector to see what it's doing with your System.String before it sends it over the wire.

Jon Skeet · Answer

Almost anything would be cleaner than this - you're really abusing text here, IMO. You're trying to represent effectively opaque binary data (the encoded text) as text data... so you'll potentially get things like bell characters, escapes etc.

The normal way of encoding opaque binary data in text is base64, so you could use:

return Convert.ToBase64String(Encoding.GetEncoding(936).GetBytes(text));

The resulting text will be entirely ASCII, which is much less likely to cause you hassle.

EDIT: If you need that output, I would strongly recommend that you represent it as a byte array instead of as a string... pass it around as a byte array from that point onwards, so you're not tempted to perform string operations on it.

Can we simplify this string encoding code

Tags:

c#

optimization

character-encoding

Jason Kealey

2 Answers

Eamon Nerbonne

Jon Skeet

Recent Activity

Donate For Us

Can we simplify this string encoding code

Tags:

c#

optimization

character-encoding

Jason Kealey

2 Answers

Eamon Nerbonne

Jon Skeet

Related questions

Recent Activity

Donate For Us