Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Entity references and lxml

Tags:

python

xml

lxml

Here's the code I have:

from cStringIO import StringIO
from lxml import etree

xml = StringIO('''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ENTITY test "This is a test">
]>
<root>
  <sub>&test;</sub>
</root>''')

d1 = etree.parse(xml)
print '%r' % d1.find('/sub').text

parser = etree.XMLParser(resolve_entities=False)
d2 = etree.parse(xml, parser=parser)
print '%r' % d2.find('/sub').text

Here's the output:

'This is a test'
None

How do I get lxml to give me '&test;', i.e., the raw entity reference?

like image 283
Ignacio Vazquez-Abrams Avatar asked Mar 26 '10 15:03

Ignacio Vazquez-Abrams


People also ask

What is lxml used for?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

What is lxml HTML?

Introduction. The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

What is Etree in lxml?

etree only returns real Elements, i.e. tree nodes that have a string tag name. Without a filter, both libraries iterate over all nodes. Note that currently only lxml. etree supports passing the Element factory function as filter to select only Elements.

Is lxml in Python standard library?

There is a lot of documentation on the web and also in the Python standard library documentation, as lxml implements the well-known ElementTree API and tries to follow its documentation as closely as possible. The recipes in Fredrik Lundh's element library are generally worth taking a look at.


1 Answers

The "unresolved" Entity is left as child node of the element node sub

>>> print d2.find('/sub')[0]
&test;
>>> d2.find('/sub').getchildren()
[&test;]
like image 137
MattH Avatar answered Oct 17 '22 03:10

MattH