I am using the below to get all of the html content of a section to save to a database
el = doc.get_element_by_id('productDescription')
lxml.html.tostring(el)
The product description has a tag that looks like this:
<div id='productDescription'>
<THE HTML CODE I WANT>
</div>
The code works great , gives me all of the html code but how do I remove the outer layer i.e. the <div id='productDescription'>
and the closing tag </div>
?
You could convert each child to string individually:
text = el.text
text += ''.join(map(lxml.html.tostring, el.iterchildren()))
Or in even more hackish way:
el.attrib.clear()
el.tag = '|||'
text = lxml.html.tostring(el)
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>')
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]
if your productDescription
div
div contains mixed text/elements content, e.g.
<div id='productDescription'>
the
<b> html code </b>
i want
</div>
you can get the content (in string) using xpath('node()')
traversal:
s = ''
for node in el.xpath('node()'):
if isinstance(node, basestring):
s += node
else:
s += lxml.html.tostring(node, with_tail=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With