Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a string with character codes above 127 to a byte array properly?

I am retrieving ASCII strings encoded with code page 437 from another system which I need to transform to Unicode so they can be mixed with other Unicode strings.

This is what I am working with:

var asciiString = "\u0094"; // 94 corresponds represents 'ö' in code page 437.

var asciiEncoding = Encoding.GetEncoding(437);
var unicodeEncoding = Encoding.Unicode;

// This is what I attempted to do but it seems not to be able to support the eight bit. Characters using the eight bit are replaced with '?' (0x3F)
var asciiBytes = asciiEncoding.GetBytes(asciiString);

// This work-around does the job, but there must be built in functionality to do this?
//var asciiBytes = asciiString.Select(c => (byte)c).ToArray();

// This piece of code happliy converts the character correctly to unicode { 0x94 } => { 0xF6, 0x0 } .
var unicodeBytes = Encoding.Convert(asciiEncoding, unicodeEncoding, asciiBytes);
var unicodeString = unicodeEncoding.GetString(unicodeBytes); // I want this to be 'ö'.

What I am struggling with is that I cannot find a suitable method in the .NET framework to transform a string with character codes above 127 to a byte array. This seems strange since there are support there to transform a byte array with characters above 127 to Unicode strings.

So my question is, is there any built in method to do this conversion properly or is my work-around the proper way to do it?

like image 491
Oskar Sjöberg Avatar asked Dec 26 '22 19:12

Oskar Sjöberg


2 Answers

var asciiString = "\u0094";

Whatever you name it, this will always be a Unicode string. .NET only has Unicode strings.

I am retrieving ASCII strings encoded with code page 437 from another system

Treat the incoming data as byte[], not as string.

var asciiBytes = new byte[] { 0x94 }; // 94 corresponds represents 'ö' in code page 437.

var asciiEncoding = Encoding.GetEncoding(437);    

var unicodeString = asciiEncoding.GetString(asciiBytes);
like image 125
Henk Holterman Avatar answered Dec 31 '22 10:12

Henk Holterman


\u0094 is Unicode code-point 0094, which is a control character; it is not ö. If you wanted ö, the correct string is

string s = "ö";

which is LATIN SMALL LETTER O WITH DIAERESIS, aka code-point 00F6.

So:

var s = "\u00F6"; // Identical to "ö"

Now we get our encoding:

var enc = Encoding.GetEncoding(437);
var bytes = enc.GetBytes(s);

And we find that it is a single-byte decimal 148, which is hex 94 - i.e. what you were after.

The significance here is that in C# when you use the "\uXXXX" syntax, the XXXX is always referring to Unicode code-points, not the encoded value in some particular encoding.

like image 31
Marc Gravell Avatar answered Dec 31 '22 12:12

Marc Gravell