Well i have a byte array, and i know its a xml serilized object in the byte array is there any way to get the encoding from it?
Im not going to deserilize it but im saving it in a xml field on a sql server... so i need to convert it to a string?
A solution similar to this question could solve this by using a Stream over the byte array. Then you won't have to fiddle at the byte level. Like this:
Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
using (var xmlreader = new XmlTextReader(stream))
{
xmlreader.MoveToContent();
encoding = xmlreader.Encoding;
}
}
The W3C XML specification has a section on how to determine the encoding of a byte string.
A BOM is just another character; it's the:
'ZERO WIDTH NO-BREAK SPACE' (U+FEFF)
For example:
"\ufeff<xml vers"
"\ufeff\u003c\u003f\u0078\u006d\u006c\u0020\u0076\u0065\u0072\u0073"
The character U+FEFF, along with every other character in the file, is encoded using the appropriate encoding scheme:
00 00 FE FF
: UCS-4, big-endian machine (1234 order)
FF FE 00 00
: UCS-4, little-endian machine (4321 order)
00 00 FF FE
: UCS-4, unusual octet order (2143)
FE FF 00 00
: UCS-4, unusual octet order (3412)
FE FF ## ##
: UTF-16, big-endian
FF FE ## ##
: UTF-16, little-endian
EF BB BF
: UTF-8
where ## ##
can be anything - except for both being zero
ff
fe
3c
00
3f
00
78
00
6d
00
6c
00
20
00
76
00
65
00
72
00
73
00
ff
fe
3c
00
3f
00
78
00
6d
00
6c
00
20
00
76
00
65
00
72
00
73
00
So first check the inital bytes for any of those signatures. If you find one of them, return that code-page identifier
UInt32 GuessEncoding(byte[] XmlString)
{
if BytesEqual(XmlString, [00, 00, $fe, $ff]) return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
if BytesEqual(XmlString, [$ff, $fe, 00, 00]) return 1200; //"utf-32" - Unicode UTF-32, little endian byte order
if BytesEqual(XmlString, [00, 00, $ff, $fe]) throw new Exception("Nobody supports 2143 UCS-4");
if BytesEqual(XmlString, [$fe, $ff, 00, 00]) throw new Exception("Nobody supports 3412 UCS-4");
if BytesEqual(XmlString, [$fe, $ff])
{
if (XmlString[2] <> 0) && (XmlString[3] <> 0)
return 1201; //"unicodeFFFE" - Unicode UTF-16, big endian byte order
}
if BytesEqual(XmlString, [$ff, $fe])
{
if (XmlString[2] <> 0) && (XmlString[3] <> 0)
return 1200; //"utf-16" - Unicode UTF-16, little endian byte order
}
if BytesEqual(XmlString, [$ef, $bb, $bf]) return 65001; //"utf-8" - Unicode (UTF-8)
If the XML document has no Byte Order Mark character, then you move on to looking for the first five characters that every XML document must have:
<?xml
It's helpful to know that
<
is #x0000003C?
is #x0000003FWith that we have enough to look at the first four bytes:
00 00 00 3C
: UCS-4, big-endian machine (1234 order)
3C 00 00 00
: UCS-4, little-endian machine (4321 order)
00 00 3C 00
: UCS-4, unusual octet order (2143)
00 3C 00 00
: UCS-4, unusual octet order (3412)
00 3C 00 3F
: UTF-16, big-endian
3C 00 3F 00
: UTF-16, little-endian
3C 3F 78 6D
: UTF-8
4C 6F A7 94
: some flavor of EBCDIC
So we can then add more to our code:
if BytesEqual(XmlString, [00, 00, 00, $3C]) return 12001; //"utf-32BE" - Unicode UTF-32, big endian byte order
if BytesEqual(XmlString, [$3C, 00, 00, 00]) return 1200; //"utf-32" - Unicode UTF-32, little endian byte order
if BytesEqual(XmlString, [00, 00, $3C, 00]) throw new Exception("Nobody supports 2143 UCS-4");
if BytesEqual(XmlString, [00, $3C, 00, 00]) throw new Exception("Nobody supports 3412 UCS-4");
if BytesEqual(XmlString, [00, $3C, 00, $3F]) return return 1201; //"unicodeFFFE" - Unicode UTF-16, big endian byte order
if BytesEqual(XmlString, [$3C, 00, $3F, 00]) return 1200; //"utf-16" - Unicode UTF-16, little endian byte order
if BytesEqual(XmlString, [$3C, $3F, $78, $6D]) return 65001; //"utf-8" - Unicode (UTF-8)
if BytesEqual(XmlString, [$4C, $6F, $A7, $94])
{
//Some variant of EBCDIC, e.g.:
//20273 IBM273 IBM EBCDIC Germany
//20277 IBM277 IBM EBCDIC Denmark-Norway
//20278 IBM278 IBM EBCDIC Finland-Sweden
//20280 IBM280 IBM EBCDIC Italy
//20284 IBM284 IBM EBCDIC Latin America-Spain
//20285 IBM285 IBM EBCDIC United Kingdom
//20290 IBM290 IBM EBCDIC Japanese Katakana Extended
//20297 IBM297 IBM EBCDIC France
//20420 IBM420 IBM EBCDIC Arabic
//20423 IBM423 IBM EBCDIC Greek
//20424 IBM424 IBM EBCDIC Hebrew
//20833 x-EBCDIC-KoreanExtended IBM EBCDIC Korean Extended
//20838 IBM-Thai IBM EBCDIC Thai
//20866 koi8-r Russian (KOI8-R); Cyrillic (KOI8-R)
//20871 IBM871 IBM EBCDIC Icelandic
//20880 IBM880 IBM EBCDIC Cyrillic Russian
//20905 IBM905 IBM EBCDIC Turkish
//20924 IBM00924 IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
throw new Exception("We don't support EBCDIC. Sorry");
}
//Otherwise assume UTF-8, and fail to decode it anyway
return 65001; //"utf-8" - Unicode (UTF-8)
//Any code is in the public domain. No attribution required.
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With