Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get the full contents of a node using xpath & lxml?

I am using lxml's xpath function to retrieve parts of a webpage. I am trying to get contents of a <font> tag, which includes html tags of its own. If I use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]

I get the right amount of nodes, but they are returned as lxml objects (<Element font at 0x101fe5eb0>).

If I use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text()

I get exactly what I want, except that I don't get any of the HTML code which is contained within the <font> nodes.

If I use

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/node()

if get a mixture of text and lxml elements! (e.g. something something <Element a at 0x102ac2140> something)

Is there anyway to use a pure XPath query to get the contents of the <font> nodes, or even to force lxml to return a string of the contents from the .xpath() method, rather than an lxml object?

Note that I'm returning a list of many nodes from the XPath query so the solution needs to support that.

just to clarify... i want to return something something <a href="url">inside</a> something from something like...

<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
like image 651
significance Avatar asked Oct 14 '22 21:10

significance


1 Answers

I'm not sure I understand -- is this close to what you are looking for?

import lxml.etree as le
import cStringIO
content='''\
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
'''
doc=le.parse(cStringIO.StringIO(content))

xpath='//font[@face="verdana" and @color="#ffffff" and @size="2"]/child::*'
x=doc.xpath(xpath)
print(map(le.tostring,x))
# ['<a href="url">inside</a> something']
like image 107
unutbu Avatar answered Nov 01 '22 10:11

unutbu