Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are invalid characters in XML

Tags:

xml

People also ask

What are the invalid characters in XML?

The only illegal characters are & , < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed' ). They're escaped using XML entities, in this case you want &amp; for & .

How do I find an invalid character in XML?

If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.

Which characters are allowed in XML?

XML 1.0. Unicode code points in the following ranges are valid in XML 1.0 documents: U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0; U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden);


OK, let's separate the question of the characters that:

  1. aren't valid at all in any XML document.
  2. need to be escaped.

The answer provided by @dolmen in "https://stackoverflow.com/questions/730133/invalid-characters-in-xml/5110103#5110103" is still valid but needs to be updated with the XML 1.1 specification.

1. Invalid characters

The characters described here are all the characters that are allowed to be inserted in an XML document.

1.1. In XML 1.0

  • Reference: see XML recommendation 1.0, §2.2 Characters

The global list of allowed characters is:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Basically, the control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity &#x3; is forbidden.

1.2. In XML 1.1

  • Reference: see XML recommendation 1.1, §2.2 Characters, and 1.3 Rationale and list of changes for XML 1.1

The global list of allowed characters is:

[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

This revision of the XML recommendation has extended the allowed characters so control characters are allowed, and takes into account a new revision of the Unicode standard, but these ones are still not allowed : NUL (x00), xFFFE, xFFFF...

However, the use of control characters and undefined Unicode char is discouraged.

It can also be noticed that all parsers do not always take this into account and XML documents with control characters may be rejected.

2. Characters that need to be escaped (to obtain a well-formed document):

The < must be escaped with a &#60; entity, since it is assumed to be the beginning of a tag.

The & must be escaped with a &#38; entity, since it is assumed to be the beginning a entity reference

The > should be escaped with &#62; entity. It is not mandatory -- it depends on the context -- but it is strongly advised to escape it.

The ' should be escaped with a &#39; entity -- mandatory in attributes defined within single quotes but it is strongly advised to always escape it.

The " should be escaped with a &#34; entity -- mandatory in attributes defined within double quotes but it is strongly advised to always escape it.


The list of valid characters is in the XML specification:

Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The only illegal characters are &, < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed').

They're escaped using XML entities, in this case you want &amp; for &.

Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.


This is a C# code to remove the XML invalid characters from a string and return a new valid string.

public static string CleanInvalidXmlChars(string text) 
{ 
    // From xml spec valid chars: 
    // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
    // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
    string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]"; 
    return Regex.Replace(text, re, ""); 
}

The predeclared characters are:

& < > " '

See "What are the special characters in XML?" for more information.