Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The Invulnerable XMLException

Background

I serialize a very large List<string> using this code:

public static string SerializeObjectToXML<T>(T item)
{
    XmlSerializer xs = new XmlSerializer(typeof(T));
    using (StringWriter writer = new StringWriter())
    {
        xs.Serialize(writer, item);
        return writer.ToString();
    }
}

And deserialize it using this code:

public static T DeserializeXMLToObject<T>(string xmlText)
{
    if (string.IsNullOrEmpty(xmlText)) return default(T);
    XmlSerializer xs = new XmlSerializer(typeof(T));
    using (MemoryStream memoryStream = new MemoryStream(new UnicodeEncoding().GetBytes(xmlText.Replace((char)0x1A, ' '))))
    using (XmlTextReader xsText = new XmlTextReader(memoryStream))
    {
        xsText.Normalization = true;
        return (T)xs.Deserialize(xsText);
    }
}

But I get this exception when I deserialize it:

XMLException: There is an error in XML document (217388, 15). '[]', hexadecimal value 0x1A, is an invalid character. Line 217388, position 15.

at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader, String encodingStyle, XmlDeserializationEvents events)

at System.Xml.Serialization.XmlSerializer.Deserialize(XmlReader xmlReader)

Question

Why is the xmlText.Replace((char)0x1A, ' ') line not working, what witchery is this?

Some Constraints

  • My code is in C#, framework 4, built in VS2010 Pro.
  • I can't view the value of xmlText in debug mode because the List<string> is too big and the watch windows just displays the Unable to evaluate the expression. Not enough storage is available to complete this operation. error message.
like image 660
John Isaiah Carmona Avatar asked Mar 21 '12 02:03

John Isaiah Carmona


Video Answer


2 Answers

I think I've found the problem. By default, XmlSerializer will allow you to generate invalid XML.

Given the code:

var input = "\u001a";

var writer = new StringWriter();
var serializer = new XmlSerializer(typeof(string));
serializer.Serialize(writer, input);

Console.WriteLine(writer.ToString());

The output is:

<?xml version="1.0" encoding="utf-16"?>
<string>&#x1A;</string>

This is invalid XML. According to the XML specification, all character references must be to characters which are valid. Valid characters are:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

As you can see, U+001A (and all other C0/C1 control characters) are not allowed as references, since they are not valid characters.

The error message given by the decoder is a bit misleading, and would be clearer if it said that there was an invalid character reference.

There are several options for what you can do.

1) Don't let the XmlSerializer create invalid documents in the first place

You can use an XmlWriter, which by default will not allow invalid characters:

var input = "\u001a";

var writer = new StringWriter();
var serializer = new XmlSerializer(typeof(string));

// added following line:
var xmlWriter = XmlWriter.Create(writer);

// then, write via the xmlWriter rather than writer:
serializer.Serialize(xmlWriter, input);

Console.WriteLine(writer.ToString());

This will throw an exception when the serialization occurs. This will have to be handled and an appropriate error shown.

This probably isn't useful for you because you have data already stored with these invalid characters.

or 2) Strip out references to this invalid character

That is, instead of .Replace((char)0x1a, ' '), which isn't actually replacing anything in your document at the moment, use .Replace("&#x1A;", " "). (This isn't case-insensitive, but it is what .NET generates. A more robust solution would be to use a case-insensitive regex.)


As an aside, XML 1.1 actually allows references to control characters, as long as they are references and not plain characters in the document. This would solve your problem apart from the fact that the .NET XmlSerializer doesn't support version 1.1.

like image 116
porges Avatar answered Sep 20 '22 05:09

porges


If you have existing data where you have serialised a class which contains characters which cannot subsequently be deserialised you can sanitise the data with the following method:

public static string SanitiseSerialisedXml(this string serialized)
{
    if (serialized == null)
    {
        return null;
    }

    const string pattern = @"&#x([0-9A-F]{1,2});";

    var sanitised = Regex.Replace(serialized, pattern, match =>
    {
        var value = match.Groups[1].Value;

        int characterCode;
        if (int.TryParse(value, NumberStyles.HexNumber, CultureInfo.InvariantCulture, out characterCode))
        {
            if (characterCode >= char.MinValue && characterCode <= char.MaxValue)
            {
                return XmlConvert.IsXmlChar((char)characterCode) ? match.Value : string.Empty;
            }
        }

        return match.Value;
    });

    return sanitised;
}

The preferable solution is to not allow serliazation on invalid characters at the point of serialization as per point 1 of Porges' answer. This code covers point 2 of Porges' answer (Strip out references to this invalid character) and strips out all invalid characters. The above code was written to solve a problem where we had stored serialized data in a database field so needed to fix legacy data and solving the problem at the point of serialization was not an option.

like image 34
jim.taylor.1974 Avatar answered Sep 20 '22 05:09

jim.taylor.1974