I receive xml strings from an external source that can contains unsanitized user contributed content.
The following xml string gave a ParseError in cElementTree
:
>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
ET.XML(s)
File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17
Is there a way to make cElementTree not complain?
It seems to complain about \x08
you will need to escape that.
Edit:
Or you can have the parser ignore the errors using recover
from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)
I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)
import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)
EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...
See this answer to another question and the according part of the XML spec.
The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity 
and cannot occur plainly.
If you need to process this XML snippet, you must replace \x08
in s
before feeding it into an XML parser.
None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup
instead of ElementTree
as follows:
from bs4 import BeautifulSoup
with open("data/myfile.xml") as fp:
soup = BeautifulSoup(fp, 'xml')
Then you can search the tree as:
soup.find_all('mytag')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With