I want to get rid of all invalid characters; example hexadecimal value 0x1A
from an XML file using sed.
What is the regex and the command line?
EDIT
Added Perl tag hoping to get more responses. I prefer a one-liner solution.
EDIT
These are the valid XML characters
x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]
SecurityElement. Escape(yourstring) ? This will replace invalid XML characters in a string with their valid equivalent.
If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.
Some special characters are not permitted in XML attribute values. Note that the ampersand (&) and less-than (<) characters are not permitted in XML attribute values.
Assuming UTF-8 XML documents:
perl -CSDA -pe'
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml
If you want to encode the bad bytes instead,
perl -CSDA -pe'
s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
"&#".ord($1).";"
/xeg;
' file.xml > file_fixed.xml
You can call it a few different ways:
perl -CSDA -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml # Inplace with backup
perl -CSDA -i -pe'...' file.xml # Inplace without backup
The tr
command would be simpler. So, try something like:
cat <filename> | tr -d '\032' > <newfilename>
Note that ascii character '0x1a' has the octal value '032', so we use that instead with tr
. Not sure if tr
likes hex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With