I am using lxml's xpath function to retrieve parts of a webpage. I am trying to get contents of a <font>
tag, which includes html tags of its own. If I use
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]
I get the right amount of nodes, but they are returned as lxml objects (<Element font at 0x101fe5eb0>
).
If I use
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text()
I get exactly what I want, except that I don't get any of the HTML code which is contained within the <font>
nodes.
If I use
//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/node()
if get a mixture of text and lxml elements! (e.g. something something <Element a at 0x102ac2140> something
)
Is there anyway to use a pure XPath query to get the contents of the <font>
nodes, or even to force lxml to return a string of the contents from the .xpath()
method, rather than an lxml object?
Note that I'm returning a list of many nodes from the XPath query so the solution needs to support that.
just to clarify... i want to return something something <a href="url">inside</a> something
from something like...
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
I'm not sure I understand -- is this close to what you are looking for?
import lxml.etree as le
import cStringIO
content='''\
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
'''
doc=le.parse(cStringIO.StringIO(content))
xpath='//font[@face="verdana" and @color="#ffffff" and @size="2"]/child::*'
x=doc.xpath(xpath)
print(map(le.tostring,x))
# ['<a href="url">inside</a> something']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With