How to make a html text without a root tag (usually it's <html></html>)? To example, for use in CDATA:
<![CDATA[<div class="foo"></div><p>bar</p>]]>
My code:
from lxml import etree
html = etree.Element('root')
etree.SubElement(html, 'div', attrib={'class':'foo'})
etree.SubElement(html, 'p').text='bar'
t = etree.tostring(html)
# '<root><div class="foo"/><p>bar</p></root>'
I would not want to use regex to remove the root tag.
If you need the text representation of all subelements without the root element, you can do:
subels = ''.join([etree.tostring(el).decode('ascii') for el in html])
where html is the Element of your question. In this case subels is a string:
'<div class="foo"/><p>bar</p>'
This can be further improved to get only specific tags using the iter method. For example:
subels = ''.join([etree.tostring(el).decode('ascii') for el in html.iter('div', 'p'])
will return only the 'div' and 'p' tags, so if there had be other tags they would have been omitted.
You can use it to filter out unwanted tags, but just be careful because it may broke the document hierarchy: it still returns children tags of undesired tags.
If the root tag has a text attibute which you want to keep, just add it back.
subels = ''.join([html.text] + [etree.tostring(el).decode('ascii') for el in html])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With