Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with forbidden characters in XML using C# .NET

I have an object that I am serializing to xml. It appears that a value in one of the properties contains the hex character 0x1E. I've tried setting The Encoding property of XmlWriterSettings to both "utf-16" and "unicode" but I still get an exception thrown:

here was an error generating the XML document. ---> System.InvalidOperationException: There was an error generating the XML document. ---> System.ArgumentException: '', hexadecimal value 0x1E, is an invalid character.

Is there any way to get these characters into the xml? If not, are there other characters that will cause problems?

like image 484
Jeremy Avatar asked Oct 29 '09 22:10

Jeremy


2 Answers

The XML Recommendation (aka spec) http://www.w3.org/TR/2000/REC-xml-20001006 outlines which characters are not allowed and must be escaped


2.2 Characters

[Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646] (see also [ISO/IEC 10646-2000]). Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char. The use of "compatibility characters", as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]), is discouraged.]

Character Range

[2]     Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] |
            [#xE000-#xFFFD] | [#x10000-#x10FFFF]    
     /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.


like image 99
peter.murray.rust Avatar answered Sep 26 '22 00:09

peter.murray.rust


i know this is an old question , but i found a link and iam posting it here , it will be useful to who come across this question. It worked for me.

http://seattlesoftware.wordpress.com/2008/09/11/hexadecimal-value-0-is-an-invalid-character/

and code from that site.(in case if the site goes down)

/// <summary>
/// Remove illegal XML characters from a string.
/// </summary>
public string SanitizeXmlString(string xml)
{
if (xml == null)
{
    throw new ArgumentNullException("xml");
}

StringBuilder buffer = new StringBuilder(xml.Length);

foreach (char c in xml)
{
    if (IsLegalXmlChar(c))
    {
        buffer.Append(c);
    }
}

return buffer.ToString();
}

/// <summary>
/// Whether a given character is allowed by XML 1.0.
/// </summary>
public bool IsLegalXmlChar(int character)
{
return
(
     character == 0x9 /* == '\t' == 9   */          ||
     character == 0xA /* == '\n' == 10  */          ||
     character == 0xD /* == '\r' == 13  */          ||
    (character >= 0x20    && character <= 0xD7FF  ) ||
    (character >= 0xE000  && character <= 0xFFFD  ) ||
    (character >= 0x10000 && character <= 0x10FFFF)
);
}
like image 34
Mourya Avatar answered Sep 26 '22 00:09

Mourya