.NET's XmlTextWriter
creates invalid xml files.
In XML, some control characters are allowed, like 'horizontal tab' (	
), but others are not, like 'vertical tab' (
). (See spec.)
I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter
escapes the character, the resulting XML is ofcourse still invalid.
How can I make sure that XmlTextWriter
never produces an illegal XML file?
Or, if it's not possible to do this with XmlTextWriter
, how can I strip the specific control characters that aren't allowed in XML from a string?
Example code:
using (XmlTextWriter writer =
new XmlTextWriter("test.xml", Encoding.UTF8))
{
writer.WriteStartDocument();
writer.WriteStartElement("Test");
writer.WriteValue("hello \xb world");
writer.WriteEndElement();
writer.WriteEndDocument();
}
Output:
<?xml version="1.0" encoding="utf-8"?><Test>hello  world</Test>
This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.
The default behavior of an XmlWriter created using Create is to throw an ArgumentException when attempting to write character values in the range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD). These invalid XML characters can be written by creating the XmlWriter with the CheckCharacters property set to false. Doing so will result in the characters being replaced with numeric character entities (
�
through�x1F
). Additionally, an XmlTextWriter created with the new operator will replace the invalid characters with numeric character entities by default.
So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.
Just found this question when I was struggling with the same issue and I ended up solving it with an regex:
return Regex.Replace(s, @"[\u0000-\u0008\u000B\u000C\u000E-\u001F]", "");
Hope it helps someone as an alternative solution.
Built in .NET escapers such as SecurityElement.Escape
don't properly escape/strip it either.
CheckCharacters
to false
on both the writer and the reader if your application is the only one interacting with the file. The resulting XML file would still be technically invalid though. See:
XmlWriterSettings xmlWriterSettings = new XmlWriterSettings();
xmlWriterSettings.Encoding = new UTF8Encoding(false);
xmlWriterSettings.CheckCharacters = false;
var sb = new StringBuilder();
var w = XmlWriter.Create(sb, xmlWriterSettings);
w.WriteStartDocument();
w.WriteStartElement("Test");
w.WriteString("hello \xb world");
w.WriteEndElement();
w.WriteEndDocument();
w.Close();
var xml = sb.ToString();
CheckCharacters
to true
(which it is by default) is a bit too strict since it will simply throw an exception an alternative approach that's more lenient to invalid XML characters would be to just strip them: Googling a bit yielded the whitelist XmlTextEncoder however it'll also remove DEL
and others in the range U+007F–U+0084, U+0086–U+009F that according to Valid XML Characters on wikipedia are only valid in certain contexts and which the RFC mentions as discouraged but still valid characters.
public static class XmlTextExtentions
{
private static readonly Dictionary<char, string> textEntities = new Dictionary<char, string> {
{ '&', "&"}, { '<', "<" }, { '>', ">" },
{ '"', """ }, { '\'', "'" }
};
public static string ToValidXmlString(this string str)
{
var stripped = str
.Select((c,i) => new
{
c1 = c,
c2 = i + 1 < str.Length ? str[i+1]: default(char),
v = XmlConvert.IsXmlChar(c),
p = i + 1 < str.Length ? XmlConvert.IsXmlSurrogatePair(str[i + 1], c) : false,
pp = i > 0 ? XmlConvert.IsXmlSurrogatePair(c, str[i - 1]) : false
})
.Aggregate("", (s, c) => {
if (c.pp)
return s;
if (textEntities.ContainsKey(c.c1))
s += textEntities[c.c1];
else if (c.v)
s += c.c1.ToString();
else if (c.p)
s += c.c1.ToString() + c.c2.ToString();
return s;
});
return stripped;
}
}
This passes all the XmlTextEncoder tests except for the one that expects it to strip DEL
which XmlConvert.IsXmlChar
, Wikipedia, and the spec marks as a valid (although discouraged) character.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With