Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove invalid characters from an xml file using sed or Perl

Tags:

regex

sed

perl

I want to get rid of all invalid characters; example hexadecimal value 0x1A from an XML file using sed.
What is the regex and the command line?
EDIT
Added Perl tag hoping to get more responses. I prefer a one-liner solution.
EDIT
These are the valid XML characters

x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]
like image 882
user841550 Avatar asked Oct 14 '11 19:10

user841550


People also ask

How do you escape an invalid character in XML?

SecurityElement. Escape(yourstring) ? This will replace invalid XML characters in a string with their valid equivalent.

How do I find invalid characters in XML?

If you're unable to identify this character visually, then you can use a text editor such as TextPad to view your source file. Within the application, use the Find function and select "hex" and search for the character mentioned. Removing these characters from your source file resolve the invalid XML character issue.

Is an XML character illegal?

Some special characters are not permitted in XML attribute values. Note that the ampersand (&) and less-than (<) characters are not permitted in XML attribute values.


2 Answers

Assuming UTF-8 XML documents:

perl -CSDA -pe'
   s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml

If you want to encode the bad bytes instead,

perl -CSDA -pe'
   s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
      "&#".ord($1).";"
   /xeg;
' file.xml > file_fixed.xml

You can call it a few different ways:

perl -CSDA     -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml     # Inplace with backup
perl -CSDA -i  -pe'...' file.xml     # Inplace without backup
like image 74
ikegami Avatar answered Nov 15 '22 14:11

ikegami


The tr command would be simpler. So, try something like:

cat <filename> | tr -d '\032' > <newfilename>

Note that ascii character '0x1a' has the octal value '032', so we use that instead with tr. Not sure if tr likes hex.

like image 20
George Avatar answered Nov 15 '22 14:11

George