Consider the following:
from lxml import etree
from StringIO import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
This would fail with:lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11
This is because resolve_entities=False
doesn't ignore them, it just doesn't resolve them.
If I use etree.HTMLParser
instead, it creates html
and body
tags, plus a lot of other special handling it tries to do for HTML
.
What's the best way to get a â
text child under the aa
tag with lxml?
You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.
https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html
You can try lying and putting an XHTML DTD on your document. e.g.
from lxml import etree
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa> â</aa>'
@Alex is right: your document is not well-formed XML, and so XML parsers will not parse it. One option is to pre-process the text of the document to replace bogus entities with their utf-8 characters:
entities = [
(' ', u'\u00a0'),
('â', u'\u00e2'),
...
]
for before, after in entities:
x = x.replace(before, after.encode('utf8'))
Of course, this can be broken by sufficiently weird "xml" also.
Your best bet is to fix your input XML documents to be well-formed XML.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With