The list of valid XML characters is well known, as defined by the spec it's:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.
The only illegal characters are & , < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use " here, ' is allowed" and attr='must use ' here, " is allowed' ). They're escaped using XML entities, in this case you want & for & .
If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.
This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.
Open an XML document in the text editing mode, right click inside it and there is a new menu item "Determine Complex Layout Chars".
I know this isn't exactly an answer to your question, but it's helpful to have it here:
Regular Expression to match valid XML Characters:
[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]
So to remove invalid chars from XML, you'd do something like
// filters control characters but allows only properly-formed surrogate sequences private static Regex _invalidXMLChars = new Regex( @"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]", RegexOptions.Compiled); /// <summary> /// removes any unusual unicode characters that can't be encoded into XML /// </summary> public static string RemoveInvalidXMLChars(string text) { if (string.IsNullOrEmpty(text)) return ""; return _invalidXMLChars.Replace(text, ""); }
I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With