Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ParseError: not well-formed (invalid token) using cElementTree

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTree:

>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    ET.XML(s)
  File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17

Is there a way to make cElementTree not complain?

like image 631
BioGeek Avatar asked Oct 24 '12 09:10

BioGeek


4 Answers

It seems to complain about \x08 you will need to escape that.

Edit:

Or you can have the parser ignore the errors using recover

from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)
like image 146
iabdalkader Avatar answered Oct 25 '22 16:10

iabdalkader


I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)

import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)

EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...

like image 42
juan Avatar answered Oct 25 '22 15:10

juan


See this answer to another question and the according part of the XML spec.

The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity &#8; and cannot occur plainly.

If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.

like image 7
Boldewyn Avatar answered Oct 25 '22 16:10

Boldewyn


None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:

from bs4 import BeautifulSoup

with open("data/myfile.xml") as fp:
    soup = BeautifulSoup(fp, 'xml')

Then you can search the tree as:

soup.find_all('mytag')
like image 7
tsando Avatar answered Oct 25 '22 15:10

tsando