Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

StreamWriter and UTF-8 Byte Order Marks

I'm having an issue with StreamWriter and Byte Order Marks. The documentation seems to state that the Encoding.UTF8 encoding has byte order marks enabled but when files are being written some have the marks while other don't.

I'm creating the stream writer in the following way:

this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8); 

Any ideas on what could be happening would be appreciated.

like image 875
Kevin Avatar asked Mar 10 '11 21:03

Kevin


People also ask

Does UTF-8 byte have an order mark?

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF ) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

What are UTF-8 bytes?

UTF-8 is a variable-width character encoding standard that uses between one and four eight-bit bytes to represent all valid Unicode code points.

How many bytes does UTF-8 have?

Each UTF uses a different code unit size. For example, UTF-8 is based on 8-bit code units. Therefore, each character can be 8 bits (1 byte), 16 bits (2 bytes), 24 bits (3 bytes), or 32 bits (4 bytes). Likewise, UTF-16 is based on 16-bit code units.


2 Answers

As someone pointed that out already, calling without the encoding argument does the trick. However, if you want to be explicit, try this:

using (var sw = new StreamWriter(this.Stream, new UTF8Encoding(false))) 

To disable BOM, the key is to construct with a new UTF8Encoding(false), instead of just Encoding.UTF8Encoding. This is the same as calling StreamWriter without the encoding argument, internally it's just doing the same thing.

To enable BOM, use new UTF8Encoding(true) instead.

Update: Since Windows 10 v1903, when saving as UTF-8 in notepad.exe, BOM byte is now an opt-in feature instead.

like image 100
HelloSam Avatar answered Sep 22 '22 21:09

HelloSam


The issue is due to the fact that you are using the static UTF8 property on the Encoding class.

When the GetPreamble method is called on the instance of the Encoding class returned by the UTF8 property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream (assuming a new stream).

You can avoid this by creating the instance of the UTF8Encoding class yourself, like so:

// As before. this.Writer = new StreamWriter(this.Stream,      // Create yourself, passing false will prevent the BOM from being written.     new System.Text.UTF8Encoding()); 

As per the documentation for the default parameterless constructor (emphasis mine):

This constructor creates an instance that does not provide a Unicode byte order mark and does not throw an exception when an invalid encoding is detected.

This means that the call to GetPreamble will return an empty array, and therefore no BOM will be written to the underlying stream.

like image 29
casperOne Avatar answered Sep 24 '22 21:09

casperOne