Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, lxml and removing outer tag from using lxml.html.tostring(el)

Tags:

python

lxml

I am using the below to get all of the html content of a section to save to a database

el = doc.get_element_by_id('productDescription')
lxml.html.tostring(el)

The product description has a tag that looks like this:

<div id='productDescription'>

     <THE HTML CODE I WANT>

</div>

The code works great , gives me all of the html code but how do I remove the outer layer i.e. the <div id='productDescription'> and the closing tag </div> ?

like image 609
Tampa Avatar asked Feb 14 '12 18:02

Tampa


2 Answers

You could convert each child to string individually:

text = el.text
text += ''.join(map(lxml.html.tostring, el.iterchildren()))

Or in even more hackish way:

el.attrib.clear()
el.tag = '|||'
text = lxml.html.tostring(el)
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>')
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]
like image 73
jfs Avatar answered Sep 29 '22 21:09

jfs


if your productDescription div div contains mixed text/elements content, e.g.

<div id='productDescription'>
  the
  <b> html code </b>
  i want
</div>

you can get the content (in string) using xpath('node()') traversal:

s = ''
for node in el.xpath('node()'):
    if isinstance(node, basestring):
        s += node
    else:
        s += lxml.html.tostring(node, with_tail=False)
like image 34
mykhal Avatar answered Sep 29 '22 21:09

mykhal