I'd like to print out the tree structure of an etree (formed from an html document) in a differentiable way (means that two etrees should print out differently).
What I mean by structure is the "shape" of the tree, which basically means all the tags but no attribute and no text content.
Any idea? Is there something in lxml to do that?
If not, I guess I have to iterate through the whole tree and construct a string from that. Any idea how to represent the tree in a compact way? (the "compact" feature is less relevant)
FYI it is not intended to be looked at, but to be stored and hashed to be able to make differences between several html templates.
Thanks
Introduction. The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.
etree only returns real Elements, i.e. tree nodes that have a string tag name. Without a filter, both libraries iterate over all nodes. Note that currently only lxml. etree supports passing the Element factory function as filter to select only Elements.
There is a lot of documentation on the web and also in the Python standard library documentation, as lxml implements the well-known ElementTree API and tries to follow its documentation as closely as possible. The recipes in Fredrik Lundh's element library are generally worth taking a look at.
Maybe just run some XSLT over the source XML to strip everything but the tags, it's then easy enough to use etree.tostring
to get a string you could hash...
from lxml import etree as ET
def pp(e):
print ET.tostring(e, pretty_print=True)
print
root = ET.XML("""\
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8" />
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
<boolean id="import_live">0</boolean>
</preference-set>
</project>
""")
pp(root)
xslt = ET.XML("""\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
""")
tr = ET.XSLT(xslt)
doc2 = tr(root)
root2 = doc2.getroot()
pp(root2)
Gives you the output:
<project id="8dce5d94-4273-47ef-8d1b-0c7882f91caa" kpf_version="4">
<livefolder id="8744bc67-1b9e-443d-ba9f-96e1d0007ba8" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8">Mooo</livefolder>
<livefolder id="8744bc67-1b9e-443d-ba9f" idref="707cd68a-33b5-4051-9e40-8ba686c2fdb8"/>
<preference-set idref="8dce5d94-4273-47ef-8d1b-0c7882f91caa">
<boolean id="import_live">0</boolean>
</preference-set>
</project>
<project>
<livefolder/>
<livefolder/>
<preference-set>
<boolean/>
</preference-set>
</project>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With