Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c# MemoryStream Encoding Vs. Encoding.GetChars()

I am trying to copy a byte stream from a database, encode it and finally display it on a web page. However, I am noticing different behavior encoding the content in different ways (note: I am using the "Western European" encoding which has a Latin character set and does not support chinese characters):

var encoding = Encoding.GetEncoding(1252 /*Western European*/);
using (var fileStream = new StreamReader(new MemoryStream(content), encoding))
{
    var str = fileStream.ReadToEnd();
}

Vs.

var encoding = Encoding.GetEncoding(1252 /*Western European*/);
var str = new string(encoding.GetChars(content));

If the content contains Chinese characters than the first block of code will produce a string like "D$教学而设计的", which is incorrect because the encoding shouldn't support those characters, while the second block will produce "D$教学而设计的" which is correct as those are all in the Western European character set.

What is the explanation for this difference in behavior?

like image 442
Sidawy Avatar asked Nov 02 '12 13:11

Sidawy


1 Answers

The StreamReader constructor will look for BOMs in the stream and set its encoding from them, even if you pass a different encoding.

It sees the UTF8 BOM in your data and correctly uses UTF8.

To prevent this behavior, pass false as the third parameter:

var fileStream = new StreamReader(new MemoryStream(content), encoding, false)
like image 177
SLaks Avatar answered Sep 19 '22 07:09

SLaks