Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to best detect encoding in XML file?

To load XML files with arbitrary encoding I have the following code:

Encoding encoding;
using (var reader = new XmlTextReader(filepath))
{
    reader.MoveToContent();
    encoding = reader.Encoding;
}

var settings = new XmlReaderSettings { NameTable = new NameTable() };
var xmlns = new XmlNamespaceManager(settings.NameTable);
var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default, 
    encoding);
using (var reader = XmlReader.Create(filepath, settings, context))
{
    return XElement.Load(reader);
}

This works, but it seems a bit inefficient to open the file twice. Is there a better way to detect the encoding such that I can do:

  1. Open file
  2. Detect encoding
  3. Read XML into an XElement
  4. Close file
like image 579
Peter Lillevold Avatar asked Mar 12 '09 09:03

Peter Lillevold


People also ask

What is the encoding of XML file?

XML documents generated in or parsed from national data items must be encoded in Unicode UTF-16 in big-endian format, CCSID 1200.

What is UTF encoding in XML?

Unicode Transformation Format, 8-bit encoding form is designed for ease of use with existing ASCII-based systems and enables use of all the characters in the Unicode standard.

Does XML use UTF-8?

UTF-8 is the default character encoding for XML documents. Character encoding can be studied in our Character Set Tutorial. UTF-8 is also the default encoding for HTML5, CSS, JavaScript, PHP, and SQL.

What is UTF-16 in XML?

UTF stands for UCS Transformation Format, and UCS itself means Universal Character Set. The number 8 or 16 refers to the number of bits used to represent a character. They are either 8(1 to 4 bytes) or 16(2 or 4 bytes). For the documents without encoding information, UTF-8 is set by default.


1 Answers

Ok, I should have thought of this earlier. Both XmlTextReader (which gives us the Encoding) and XmlReader.Create (which allows us to specify encoding) accepts a Stream. So how about first opening a FileStream and then use this with both XmlTextReader and XmlReader, like this:

using (var txtreader = new FileStream(filepath, FileMode.Open))
{
    using (var xmlreader = new XmlTextReader(txtreader))
    {
        // Read in the encoding info
        xmlreader.MoveToContent();
        var encoding = xmlreader.Encoding;

        // Rewind to the beginning
        txtreader.Seek(0, SeekOrigin.Begin);

        var settings = new XmlReaderSettings { NameTable = new NameTable() };
        var xmlns = new XmlNamespaceManager(settings.NameTable);
        var context = new XmlParserContext(null, xmlns, "", XmlSpace.Default,
                 encoding);

        using (var reader = XmlReader.Create(txtreader, settings, context))
        {
            return XElement.Load(reader);
        }
    }
}

This works like a charm. Reading XML files in an encoding independent way should have been more elegant but at least I'm getting away with only one file open.

like image 108
Peter Lillevold Avatar answered Sep 28 '22 04:09

Peter Lillevold