Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed in tags, i.e. flatten the first-level child and simply return the concatenated text of each flattened child separately.
Example:
<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
<ol>
<li>Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
<li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.</li>
</ol>
</Div>
desired text:
Except that the anchor tags surrounding prevent a simple retrieval.
To return each li tag separately, we use the straightforward:
//div[contains(@id,"mw-content-text")]/ol/li
but that also includes surrounding anchor tags, etc. And
//div[contains(@id,"mw-content-text")]/ol/li/text()
returns only the text elements that are direct children of li, i.e. 'Central','.'...
It seemed logical then to look for text elements of self and descendants
//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]
but that returns nothing at all!
Any suggestions? I'm using Python, so I'm open to using other modules for post-processing.
(I am using the Scrapy HtmlXPathSelector which seems XPath 1.0 compliant)
You were almost there. There is a small problem in:
//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]
The corrected expression is:
//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text()]
However, there is a simpler expression that produces exactly the wanted concatenation of all text-nodes under the specified li
:
string(//div[contains(@id,"mw-content-text")]/ol/li)
I think the following would return the correct result:
//div[contains(@id,"mw-content-text")]/ol/li//text()
Note the double slash before text(). This means text nodes on any level below li must be returned.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With