Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove BOM from byte array

I have xml data in byte[] byteArray which may or mayn't contain BOM. Is there any standard way in C# to remove BOM from it? If not, what is the best way, which handles all the cases including all types of encoding, to do the same?

Actually, I am fixing a bug in the code and I don't want to change much of the code. So it would be better if someone can give me the code to remove BOM.

I know that I can do like find out 60 which is ASCII value of '<' and ignore bytes before that but I don't want to do that.

like image 287
Ravi Gupta Avatar asked Mar 18 '13 11:03

Ravi Gupta


3 Answers

All of the C# XML parsers will automatically handle the BOM for you. I'd recommend using XDocument - in my opinion it provides the cleanest abstraction of XML data.

Using XDocument as an example:

using (var stream = new memoryStream(bytes))
{
  var document = XDocument.Load(stream);
  ...
}

Once you have an XDocument you can then use it to omit the bytes without the BOM:

using (var stream = new MemoryStream())
using (var writer = XmlWriter.Create(stream))
{
  writer.Settings.Encoding = new UTF8Encoding(false);
  document.WriteTo(writer);
  var bytesWithoutBOM = stream.ToArray();
}
like image 82
Rich O'Kelly Avatar answered Oct 13 '22 03:10

Rich O'Kelly


You could do something like this to skip the BOM bytes while reading from a stream. You would need to extend the Bom.cs to include further encodings, however afaik UTF is the only encoding using BOM... could (most likely) be wrong about that though.

I got the info on the encoding types from here

using (var stream = File.OpenRead("path_to_file"))
{
    stream.Position = Bom.GetCursor(stream);
}


public static class Bom
{
        public static int GetCursor(Stream stream)
        {
            // UTF-32, big-endian
            if (IsMatch(stream, new byte[] {0x00, 0x00, 0xFE, 0xFF}))
                return 4;
            // UTF-32, little-endian
            if (IsMatch(stream, new byte[] { 0xFF, 0xFE, 0x00, 0x00 }))
                return 4;
            // UTF-16, big-endian
            if (IsMatch(stream, new byte[] { 0xFE, 0xFF }))
                return 2;
            // UTF-16, little-endian
            if (IsMatch(stream, new byte[] { 0xFF, 0xFE }))
                return 2;
            // UTF-8
            if (IsMatch(stream, new byte[] { 0xEF, 0xBB, 0xBF }))
                return 3;
            return 0;
        }

        private static bool IsMatch(Stream stream, byte[] match)
        {
            stream.Position = 0;
            var buffer = new byte[match.Length];
            stream.Read(buffer, 0, buffer.Length);
            return !buffer.Where((t, i) => t != match[i]).Any();
        }
    }
like image 42
Ross Jones Avatar answered Oct 13 '22 01:10

Ross Jones


You don't have to worry about BOM.

If for some reason you need to use and XmlDocument object maybe this code can help you:

byte[] file_content = {wherever you get it};
XmlDocument xml = new XmlDocument();
xml.Load(new MemoryStream(file_content));

It worked for me when i tried to download an xml attachment from a gmail account using Google Api and the file have BOM and using Encoding.UTF8.GetString(file_content) didn't work "properly".

like image 25
prueba prueba Avatar answered Oct 13 '22 03:10

prueba prueba