Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fix invalid XML with ampersands in Python

I am using Python to manipulate an XML file I receive from another system. That system produces invalid XML. Mainly, it doesn't escape some of the & in the XML.
So, for example, I have some lines like that:

<IceCream>Ben&Jerry</IceCream>


Of course, when parsed with SAX or DOM it throws invalid token error.
For some more general background - it's a very large file (2MB), fairly flat, and contains a lot of data in CDATA.

What I've tried:

  1. Writing a Regex to replace only unesacped &, without reesacaping &gt; and such: &(?!\w{2,4};) . It fixed it, but it escaped ampersands in CDATA, which then caused errors in a destination system. I can't unescape everything that's in CDATA afterwards because some of it needs to stay escaped.
  2. Using Beautiful (Stone) Soup. Also unlucky. Instead of escaping loose ampersands, it created an entity (i.e. &Jerry;). Not Good.

Next Step will be to write my own parser using a state machine. Save me from going down that road.
It is not a complex structure (very flat, 4 layers deep at most) so perhaps regex might be able to catch areas that aren't in a CDATA.

Many thanks.

like image 397
yulkes Avatar asked May 22 '11 15:05

yulkes


Video Answer


1 Answers

Use the Python bindings for tidylib:

>>> import tidylib
>>> print tidylib.tidy_document("<IceCream>Ben&Jerry</IceCream>", {"input_xml": True})[0]
<IceCream>Ben&amp;Jerry</IceCream>

See the official tidy documentation for a list of parser options.

like image 112
Eric Pruitt Avatar answered Oct 24 '22 05:10

Eric Pruitt