ParseError: not well-formed (invalid token) using cElementTree

Question

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTree:

>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    ET.XML(s)
  File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17

Is there a way to make cElementTree not complain?

iabdalkader · Accepted Answer

It seems to complain about \x08 you will need to escape that.

Edit:

Or you can have the parser ignore the errors using recover

from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)

juan · Answer

I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)

import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)

EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...

Boldewyn · Answer

See this answer to another question and the according part of the XML spec.

The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity  and cannot occur plainly.

If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.

tsando · Answer

None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:

from bs4 import BeautifulSoup

with open("data/myfile.xml") as fp:
    soup = BeautifulSoup(fp, 'xml')

Then you can search the tree as:

soup.find_all('mytag')

ParseError: not well-formed (invalid token) using cElementTree

Tags:

python

parsing

elementtree

BioGeek

4 Answers

iabdalkader

juan

Boldewyn

tsando

Recent Activity

Donate For Us

ParseError: not well-formed (invalid token) using cElementTree

Tags:

python

parsing

elementtree

BioGeek

4 Answers

iabdalkader

juan

Boldewyn

tsando

Related questions

Recent Activity

Donate For Us