Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding.UTF8.GetString doesn't take into account the Preamble/BOM

In .NET, I'm trying to use Encoding.UTF8.GetString method, which takes a byte array and converts it to a string.

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

I know I can use a TextReader to digest the BOM as needed, but I thought that the GetString method should be some kind of a macro that makes our code shorter.

Am I missing something? Is this like so intentionally?

Here's a reproduction code:

static void Main(string[] args) {     string s1 = "abc";     byte[] abcWithBom;     using (var ms = new MemoryStream())     using (var sw = new StreamWriter(ms, new UTF8Encoding(true)))     {         sw.Write(s1);         sw.Flush();         abcWithBom = ms.ToArray();         Console.WriteLine(FormatArray(abcWithBom)); // ef, bb, bf, 61, 62, 63     }      byte[] abcWithoutBom;     using (var ms = new MemoryStream())     using (var sw = new StreamWriter(ms, new UTF8Encoding(false)))     {         sw.Write(s1);         sw.Flush();         abcWithoutBom = ms.ToArray();         Console.WriteLine(FormatArray(abcWithoutBom)); // 61, 62, 63     }      var restore1 = Encoding.UTF8.GetString(abcWithoutBom);     Console.WriteLine(restore1.Length); // 3     Console.WriteLine(restore1); // abc      var restore2 = Encoding.UTF8.GetString(abcWithBom);     Console.WriteLine(restore2.Length); // 4 (!)     Console.WriteLine(restore2); // ?abc }  private static string FormatArray(byte[] bytes1) {     return string.Join(", ", from b in bytes1 select b.ToString("x")); } 
like image 818
Ron Klein Avatar asked Jul 28 '12 13:07

Ron Klein


1 Answers

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

It doesn't look like it "ignores" it at all - it faithfully converts it to the BOM character. That's what it is, after all.

If you want to make your code ignore the BOM in any string it converts, that's up to you to do... or use StreamReader.

Note that if you either use Encoding.GetBytes followed by Encoding.GetString or use StreamWriter followed by StreamReader, both forms will either produce then swallow or not produce the BOM. It's only when you mix using a StreamWriter (which uses Encoding.GetPreamble) with a direct Encoding.GetString call that you end up with the "extra" character.

like image 145
Jon Skeet Avatar answered Oct 19 '22 10:10

Jon Skeet