Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python gives 'Not well-formed xml' error because of presence of '&' characters

I am reading an xml file using Python. But my xml file contains & characters, because of which while running my Python code, it gives the following error:

xml.parsers.expat.ExpatError: not well-formed (invalid token):

Is there a way to ignore the & check by python?

like image 675
SyncMaster Avatar asked Nov 14 '11 00:11

SyncMaster


People also ask

What causes XML error?

The most common cause is encoding errors. There are several basic approaches to solving this: escaping problematic characters ( < becomes &lt; , & becomes &amp; , etc.), escaping entire blocks of text with CDATA sections, or putting an encoding declaration at the start of the feed.

What is XML parsing error not well-formed?

"XML Parsing Error" occurs when something is trying to read the XML, not when it is being generated. Also, "not well-formed" usually refers to errors in the structure of the document, such as a missing end-tag, not the characters it contains.

How does Python handle XML?

To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().


2 Answers

No, you can't ignore the check. Your 'xml file' is not an XML file - to be an XML file, the ampersand would have to be escaped. Therefore, no software that is designed to read XML files will parse it without error. You need to correct the software that generated this file so that it generates proper ("well-formed") XML. All the benefits of using XML for interchange disappear entirely if people start sending stuff that isn't well-formed and people receiving it try to patch it up.

like image 50
Michael Kay Avatar answered Sep 28 '22 02:09

Michael Kay


For me adding the line "<?xml version='1.0' encoding='iso-8859-1'?>" in front the string is did the trick.

>>> text = '''<?xml version="1.0" encoding="iso-8859-1"?>
    ... <seuss><fish>red</fish><fish>blu\xe9</fish></seuss>'''
>>> doc = elementtree.ElementTree.fromstring(text)

Refer this page https://mail.python.org/pipermail/tutor/2006-November/050757.html

like image 43
Kumar Avatar answered Sep 28 '22 01:09

Kumar