Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML serializing with XmlWriter via StringBuilder is utf-16 while via Stream is utf-8?

I was surprised when I encountered it, and wrote a console application to check it and make sure I wasn't doing anything else.

Can anyone explain this?

Here's the code:

using System;    
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Serialization;

namespace ConsoleApplication1
{
    public class Program
    {
        static void Main(string[] args)
        {
            var o = new SomeObject { Field1 = "string value", Field2 = 8 };

            Console.WriteLine("ObjectToXmlViaStringBuilder");
            Console.Write(ObjectToXmlViaStringBuilder(o));
            Console.WriteLine();
            Console.WriteLine();
            Console.WriteLine("ObjectToXmlViaStream");
            Console.Write(StreamToString(ObjectToXmlViaStream(o)));
            Console.ReadKey();
        }

        public static string ObjectToXmlViaStringBuilder(SomeObject someObject)
        {
            var output = new StringBuilder();
            var settings = new XmlWriterSettings { Encoding = Encoding.UTF8, Indent = true };

            using (var xmlWriter = XmlWriter.Create(output, settings))
            {
                var serializer = new XmlSerializer(typeof(SomeObject));
                var namespaces = new XmlSerializerNamespaces();

                xmlWriter.WriteStartDocument();
                xmlWriter.WriteDocType("Field1", null, "someObject.dtd", null);
                namespaces.Add(string.Empty, string.Empty);
                serializer.Serialize(xmlWriter, someObject, namespaces);
            }

            return output.ToString();
        }

        private static string StreamToString(Stream stream)
        {
            var reader = new StreamReader(stream);
            return reader.ReadToEnd();
        }

        public static Stream ObjectToXmlViaStream(SomeObject someObject)
        {
            var output = new MemoryStream();
            var settings = new XmlWriterSettings { Encoding = Encoding.UTF8, Indent = true };

            using (var xmlWriter = XmlWriter.Create(output, settings))
            {
                var serializer = new XmlSerializer(typeof(SomeObject));
                var namespaces = new XmlSerializerNamespaces();

                xmlWriter.WriteStartDocument();
                xmlWriter.WriteDocType("Field1", null, "someObject.dtd", null);
                namespaces.Add(string.Empty, string.Empty);
                serializer.Serialize(xmlWriter, someObject, namespaces);
            }

            output.Seek(0L, SeekOrigin.Begin);

            return output;
        }

        public class SomeObject
        {
            public string Field1 { get; set; }
            public int Field2 { get; set; }
        }
    }
}

This is the result:

ObjectToXmlViaStringBuilder

<?xml version="1.0" encoding="utf-16"?>
<!DOCTYPE Field1 SYSTEM "someObject.dtd">
<SomeObject>
<Field1>string value</Field1>
<Field2>8</Field2>
</SomeObject>

ObjectToXmlViaStream

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE Field1 SYSTEM "someObject.dtd">
<SomeObject>
<Field1>string value</Field1>
<Field2>8</Field2>
</SomeObject>
like image 920
H.Wolper Avatar asked May 12 '11 07:05

H.Wolper


3 Answers

When you create an XmlWriter around a TextWriter, the XmlWriter always uses the encoding of the underlying TextWriter. The encoding of a StringWriter is always UTF-16, since that's how .NET strings are encoded internally.

When you create an XmlWriter around a Stream, there is no encoding defined for the Stream, so it uses the encoding specified in the XmlWriterSettings.

like image 115
Thomas Levesque Avatar answered Nov 08 '22 00:11

Thomas Levesque


The most elegant solution for me is to write to a memorystream and then using encoding to encode the stream to whatever encoding is required. like so

        using (MemoryStream memS = new MemoryStream())
        {
            //set up the xml settings
            XmlWriterSettings settings = new XmlWriterSettings();
            settings.OmitXmlDeclaration = OmitXmlHeader;

            using (XmlWriter writer = XmlTextWriter.Create(memS, settings))
            {
                //write the XML to a stream
                xmlSerializer.Serialize(writer, objectToSerialize);
                writer.Close();
            }
            //encode the memory stream to xml
            retString.AppendFormat("{0}", encoding.GetString(memS.ToArray()));
            memS.Close();
        }

where the encoding takes place at ....encoding.GetString(memS.ToArray())...

like image 30
Coby Avatar answered Nov 08 '22 01:11

Coby


Where possible, the XmlWriter uses the encoding of the underlying stream. It it wrote UTF-8 data to a stream it knew was UTF-16, you'd end up with a mess. Writing UTF-16 data to a UTF-8 stream also causes problems, especially for environments that use null terminated strings (like C/C++).

The StringBuilder/StringWriter presents a UTF-16 stream to the XmlWriter, so the XmlWriter ignores your requested setting and uses that.

In practise I usually don't emit the header, that way I can use a StringBuilder underneath and save a few lines of code messing about with switching encodings.

like image 32
kͩeͣmͮpͥ ͩ Avatar answered Nov 07 '22 23:11

kͩeͣmͮpͥ ͩ