Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to output Byte Order Mark when writing to TextWriter?

i am writing text to a TextWriter. i want the UTF-16 Byte Order Mark (BOM) to appear in the output:

public void ProcessRequest(HttpContext context)
{
   context.Response.ContentEncoding = new UnicodeEncoding(true, true);
   WriteStuffToTextWriter(context.Response.Output);
}

Except the output doesn't contain a byte order mark:

HTTP/1.1 200 OK
Server: ASP.NET Development Server/10.0.0.0
Date: Thu, 06 Sep 2012 21:09:23 GMT
X-AspNet-Version: 4.0.30319
Content-Disposition: attachment; filename="Transactions_Calendar_20120906.csv"
Cache-Control: private
Content-Type: text/csv; filename="Transactions_Calendar_20120906.csv"; charset=utf-16BE
Content-Length: 95022
Connection: Close

JobName,ShiftName,6////09////2012 12::::00::::00 АΜ,...

How do i tell a TextWriter to write the encoding marker?

Note: The 2nd paramter in UnicodeEncoding:

   context.Response.ContentEncoding = new UnicodeEncoding(true, true);

byteOrderMark
Type: System.Boolean
true to specify that a Unicode byte order mark is provided; otherwise, false.

like image 591
Ian Boyd Avatar asked Sep 06 '12 21:09

Ian Boyd


People also ask

Does UTF-8 require a byte order mark to indicate Endianness?

A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no endian problem as there is for encoding forms that use 16-bit or 32-bit code units.

What does byte order mark do?

The byte-order mark indicates which order is used, so that applications can immediately decode the content. In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 encodings, there is no alternative sequence of bytes in a character.

What is BOM in C#?

1. From The Unicode Standard 5.0: The Unicode Standard also specifies the use of an initial byte order mark (BOM) to explicitly differentiate big-endian or little endian data in some of the Unicode encoding schemes.


1 Answers

Short Version

String zwnbsp = "\xfeff"; //Zero-width non-breaking space

//The Zero-width non-breaking space character ***is*** the Byte-Order-Mark (BOM).
String s = zwnbsp+"The quick brown fox jumped over the lazy dog.";
writer.Write(s);

Long Version

At some point i realized how simple the solution is.

i used to think that the Unicode Byte-Order-Mark was some special signature. i used to think i had to carefully decide which byte sequence i wanted to output, in order to output the correct BOM:

  • 0xFE 0xFF
  • 0xFF 0xFE
  • 0xEF 0xBB 0xBF

But since then i realized that byte Byte-Order-Mark is not some special byte sequence that you have to prepend to your file.

The BOM is just a Unicode character. You don't output any bytes; you only output character U+FEFF. The very act of writing that character, the serializer will convert it to whatever encoding you're using for you.

The character U+feff (ZERO WIDTH NO-BREAK SPACE) was chosen for good reason. It's a space, so it has no meaning, and it is zero width, so you shouldn't even see it.

That means that my question is fundamentally flawed. There is no such thing as "writing a byte-order-mark". You just make sure the first character you write out is U+FEFF. In my case i am writing to a TextWriter:

void WriteStuffToTextWriter(TextWriter writer)
{
   String csvExport = GetExportAsCSV();

   writer.Write("\xfeff"); //Output unicode charcter U+FEFF as a byte order marker
   writer.Write(csvExport);
}

The TextWriter will handle converting the unicode character U+feff into whatever byte encoding it has been configured to use.

Note: Any code is released into the public domain. No attribution required.

like image 74
Ian Boyd Avatar answered Sep 21 '22 18:09

Ian Boyd