I'm trying to parse with lxml in python and this is my output
<td>
<span style="display:inline">text1</span>
<span style="display:none">text2</span>
<span>text3</span>
text4
</td>
Thought I was smart enough to use the following
tree = tr.xpath("//*[contains(@style,'inline')]/text()")
But then I thought I would only see text1. What I want is to see text3 and text4 too so that the output will be
['text1', 'text3', 'text4']
Can anyone send me to the right direction of doing it?
/* selects the root element, regardless of name. ./* or * selects all child elements of the context node, regardless of name.
The XPath text() function is a built-in function of selenium webdriver which is used to locate elements based on text of a web element. It helps to find the exact text elements and it locates the elements within the set of text nodes. The elements to be located should be in string form.
There is no one step solution to shorten or simplify a xpath. The real challenge is to construct relative xpath i.e. in other words, convert absolute xpath into relative xpath.
Explicitly exclude anything with display:none
:
tree = tr.xpath("//*[not(contains(@style,'display:none'))]/text()")
That said -- this is only a distant approximation of what a browser would actually do; you'd want to be driving an actual browser (as with Selenium, embedding APIs, or the like) if you required strictly accurate results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With