I'm having a problem comparing strings in a Unit Test in C# 4.0 using Visual Studio 2010. This same test case works properly in Visual Studio 2008 (with C# 3.5).
Here's the relevant code snippet:
byte[] rawData = GetData();
string data = Encoding.UTF8.GetString(rawData);
Assert.AreEqual("Constant", data, false, CultureInfo.InvariantCulture);
While debugging this test, the data
string appears to the naked eye to contain exactly the same string as the literal. When I called data.ToCharArray()
, I noticed that the first byte of the string data
is the value 65279
which is the UTF-8 Byte Order Marker. What I don't understand is why Encoding.UTF8.GetString()
keeps this byte around.
How do I get Encoding.UTF8.GetString()
to not put the Byte Order Marker in the resulting string?
Update: The problem was that GetData()
, which reads a file from disk, reads the data from the file using FileStream.readbytes()
. I corrected this by using a StreamReader
and converting the string to bytes using Encoding.UTF8.GetBytes()
, which is what it should've been doing in the first place! Thanks for all the help.
If you want to remove the byte order mark from a source code, you need a text editor that offers the option of saving the mark. You read the file with the BOM into the software, then save it again without the BOM and thereby convert the coding. The mark should then no longer appear.
UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF ) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.
1. From The Unicode Standard 5.0: The Unicode Standard also specifies the use of an initial byte order mark (BOM) to explicitly differentiate big-endian or little endian data in some of the Unicode encoding schemes.
Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.
EDIT: Alternatively, you could use a StreamReader
to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString
or one character via a StreamReader
:
using System;
using System.IO;
using System.Text;
class Test
{
static void Main()
{
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
string viaEncoding = Encoding.UTF8.GetString(withBom);
Console.WriteLine(viaEncoding.Length);
string viaStreamReader;
using (StreamReader reader = new StreamReader
(new MemoryStream(withBom), Encoding.UTF8))
{
viaStreamReader = reader.ReadToEnd();
}
Console.WriteLine(viaStreamReader.Length);
}
}
There is a slightly more efficient way to do it than creating StreamReader and MemoryStream:
1) If you know that there is always a BOM
string viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
2) If you don't know, check:
string viaEncoding;
if (withBom.Length >= 3 && withBom[0] == 0xEF && withBom[1] == 0xBB && withBom[2] == 0xBF)
viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
else
viaEncoding = Encoding.UTF8.GetString(withBom);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With