Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode Regex; Invalid XML characters

The list of valid XML characters is well known, as defined by the spec it's:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

My question is whether or not it's possible to make a PCRE regular expression for this (or its inverse) without actually hard-coding the codepoints, by using Unicode general categories. An inverse might be something like [\p{Cc}\p{Cs}\p{Cn}], except that improperly covers linefeeds and tabs and misses some other invalid characters.

like image 730
Edward Z. Yang Avatar asked Dec 29 '08 06:12

Edward Z. Yang


People also ask

What characters are invalid in XML?

The only illegal characters are & , < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed' ). They're escaped using XML entities, in this case you want &amp; for & .

How do I find an invalid character in XML?

If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

How do I find special characters in XML?

Open an XML document in the text editing mode, right click inside it and there is a new menu item "Determine Complex Layout Chars".


1 Answers

I know this isn't exactly an answer to your question, but it's helpful to have it here:

Regular Expression to match valid XML Characters:

[\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD] 

So to remove invalid chars from XML, you'd do something like

// filters control characters but allows only properly-formed surrogate sequences private static Regex _invalidXMLChars = new Regex(     @"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",     RegexOptions.Compiled);  /// <summary> /// removes any unusual unicode characters that can't be encoded into XML /// </summary> public static string RemoveInvalidXMLChars(string text) {     if (string.IsNullOrEmpty(text)) return "";     return _invalidXMLChars.Replace(text, ""); } 

I had our resident regex / XML genius, he of the 4,400+ upvoted post, check this, and he signed off on it.

like image 125
Jeff Atwood Avatar answered Oct 03 '22 14:10

Jeff Atwood