Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

c# Detect xml encoding from Byte Array?

Well i have a byte array, and i know its a xml serilized object in the byte array is there any way to get the encoding from it?

Im not going to deserilize it but im saving it in a xml field on a sql server... so i need to convert it to a string?

like image 280
Peter Avatar asked Feb 24 '09 10:02

Peter


2 Answers

A solution similar to this question could solve this by using a Stream over the byte array. Then you won't have to fiddle at the byte level. Like this:

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}
like image 135
Peter Lillevold Avatar answered Oct 17 '22 02:10

Peter Lillevold


The W3C XML specification has a section on how to determine the encoding of a byte string.

First check for a Unicode Byte Order Mark

A BOM is just another character; it's the:

'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)

For example:

  • NWNBSP<?xml vers
  • "\ufeff<xml vers"
  • "\ufeff\u003c\u003f\u0078\u006d\u006c\u0020\u0076\u0065\u0072\u0073"
  • U+FEFFU+003CU+003FU+0078U+006DU+006CU+0020U+0076U+0065U+0072U+0073

The character U+FEFF, along with every other character in the file, is encoded using the appropriate encoding scheme:

  • 00 00 FE FF: UCS-4, big-endian machine (1234 order)
  • FF FE 00 00: UCS-4, little-endian machine (4321 order)
  • 00 00 FF FE: UCS-4, unusual octet order (2143)
  • FE FF 00 00: UCS-4, unusual octet order (3412)
  • FE FF ## ##: UTF-16, big-endian
  • FF FE ## ##: UTF-16, little-endian
  • EF BB BF: UTF-8

where ## ## can be anything - except for both being zero

  • U+FEFFU+003CU+003FU+0078U+006DU+006CU+0020U+0076U+0065U+0072U+0073
  • ff fe3c 003f 0078 006d 006c 0020 0076 0065 0072 0073 00
  • ff fe 3c 00 3f 00 78 00 6d 00 6c 00 20 00 76 00 65 00 72 00 73 00

So first check the inital bytes for any of those signatures. If you find one of them, return that code-page identifier

UInt32 GuessEncoding(byte[] XmlString)
{
   if BytesEqual(XmlString, [00, 00, $fe, $ff]) return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
   if BytesEqual(XmlString, [$ff, $fe, 00, 00]) return 1200;  //"utf-32" - Unicode UTF-32, little endian byte order
   if BytesEqual(XmlString, [00, 00, $ff, $fe]) throw new Exception("Nobody supports 2143 UCS-4");
   if BytesEqual(XmlString, [$fe, $ff, 00, 00]) throw new Exception("Nobody supports 3412 UCS-4");
   if BytesEqual(XmlString, [$fe, $ff])
   {
      if (XmlString[2] <> 0) && (XmlString[3] <> 0)
         return 1201;  //"unicodeFFFE" - Unicode UTF-16, big endian byte order
   }
   if BytesEqual(XmlString, [$ff, $fe])
   {
      if (XmlString[2] <> 0) && (XmlString[3] <> 0)
         return 1200;  //"utf-16" - Unicode UTF-16, little endian byte order
   }
   if BytesEqual(XmlString, [$ef, $bb, $bf])    return 65001; //"utf-8" - Unicode (UTF-8)

Or else look for <?xml

If the XML document has no Byte Order Mark character, then you move on to looking for the first five characters that every XML document must have:

<?xml

It's helpful to know that

  • < is #x0000003C
  • ? is #x0000003F

With that we have enough to look at the first four bytes:

  • 00 00 00 3C: UCS-4, big-endian machine (1234 order)
  • 3C 00 00 00: UCS-4, little-endian machine (4321 order)
  • 00 00 3C 00: UCS-4, unusual octet order (2143)
  • 00 3C 00 00: UCS-4, unusual octet order (3412)
  • 00 3C 00 3F: UTF-16, big-endian
  • 3C 00 3F 00: UTF-16, little-endian
  • 3C 3F 78 6D: UTF-8
  • 4C 6F A7 94: some flavor of EBCDIC

So we can then add more to our code:

   if BytesEqual(XmlString, [00, 00, 00, $3C])    return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
   if BytesEqual(XmlString, [$3C, 00, 00, 00])    return 1200;  //"utf-32" - Unicode UTF-32, little endian byte order
   if BytesEqual(XmlString, [00, 00, $3C, 00])    throw new Exception("Nobody supports 2143 UCS-4");
   if BytesEqual(XmlString, [00, $3C, 00, 00])    throw new Exception("Nobody supports 3412 UCS-4");
   if BytesEqual(XmlString, [00, $3C, 00, $3F])   return return 1201;  //"unicodeFFFE" - Unicode UTF-16, big endian byte order
   if BytesEqual(XmlString, [$3C, 00, $3F, 00])   return 1200;  //"utf-16" - Unicode UTF-16, little endian byte order
   if BytesEqual(XmlString, [$3C, $3F, $78, $6D]) return 65001; //"utf-8" - Unicode (UTF-8)
   if BytesEqual(XmlString, [$4C, $6F, $A7, $94])
   {
      //Some variant of EBCDIC, e.g.:
      //20273   IBM273  IBM EBCDIC Germany
      //20277   IBM277  IBM EBCDIC Denmark-Norway
      //20278   IBM278  IBM EBCDIC Finland-Sweden
      //20280   IBM280  IBM EBCDIC Italy
      //20284   IBM284  IBM EBCDIC Latin America-Spain
      //20285   IBM285  IBM EBCDIC United Kingdom
      //20290   IBM290  IBM EBCDIC Japanese Katakana Extended
      //20297   IBM297  IBM EBCDIC France
      //20420   IBM420  IBM EBCDIC Arabic
      //20423   IBM423  IBM EBCDIC Greek
      //20424   IBM424  IBM EBCDIC Hebrew
      //20833   x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended
      //20838   IBM-Thai    IBM EBCDIC Thai
      //20866   koi8-r  Russian (KOI8-R); Cyrillic (KOI8-R)
      //20871   IBM871  IBM EBCDIC Icelandic
      //20880   IBM880  IBM EBCDIC Cyrillic Russian
      //20905   IBM905  IBM EBCDIC Turkish
      //20924   IBM00924    IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
      throw new Exception("We don't support EBCDIC. Sorry");
   }

   //Otherwise assume UTF-8, and fail to decode it anyway
   return 65001; //"utf-8" - Unicode (UTF-8)

   //Any code is in the public domain. No attribution required.
}
like image 8
Ian Boyd Avatar answered Oct 17 '22 00:10

Ian Boyd