I'm looking for what the standard, approved, and robust way of stripping invalid characters from strings before writing them to an XML file. I'm talking here about blocks of text containing backspace (^H) and formfeed characters etc.
There has to be a standard library/module function for doing this but I can't find it.
I'm using XML::LibXML to build a DOM tree that I then serialize to disk.
If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.
Note that the ampersand (&) and less-than (<) characters are not permitted in XML attribute values. Since XFDL computes appear in a compute attribute, these must be escaped with character or entity references (e.g. the entity references & for the ampersand and < for the less-than character).
The complete regex for removal of invalid xml-1.0 characters is:
# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
for xml-1.1 it is:
# allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
# restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F]
$str =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go;
As almost everyone else has said, use a regular expression. It's honestly not complex enough to be worth adding to a library. Preprocess your text with a substitution.
Your comment about linefeeds above suggests that the formatting is of some importance to you so you will possibly have to decide exactly what you want to replace some characters with.
The list of invalid characters is clearly defined in the XML spec (here - http://www.w3.org/TR/REC-xml/#charsets - for example). The disallowed characters are the ASCII control characters bar carriage return, linefeed and tab. So, you are looking at a 29 character regular expression character class. That's not too bad surely.
Something like:
$text =~ s/[\x00-\x08 \x0B \x0C \x0E-\x19]//g;
should do it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With