Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML XPath: Extracting text mixed in with multiple tags?

Tags:

html

xpath

scrapy

Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed in tags, i.e. flatten the first-level child and simply return the concatenated text of each flattened child separately.

Example:

<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
    <ol>
    <li>Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
    <li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.</li>
    </ol>

    </Div>  

desired text:

  • Central Intelligence Agency
  • Culinary Institute of America

Except that the anchor tags surrounding prevent a simple retrieval.

To return each li tag separately, we use the straightforward:

//div[contains(@id,"mw-content-text")]/ol/li

but that also includes surrounding anchor tags, etc. And

//div[contains(@id,"mw-content-text")]/ol/li/text()

returns only the text elements that are direct children of li, i.e. 'Central','.'...

It seemed logical then to look for text elements of self and descendants

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

but that returns nothing at all!

Any suggestions? I'm using Python, so I'm open to using other modules for post-processing.

(I am using the Scrapy HtmlXPathSelector which seems XPath 1.0 compliant)

like image 556
ChaimKut Avatar asked May 16 '12 11:05

ChaimKut


2 Answers

You were almost there. There is a small problem in:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

The corrected expression is:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text()]

However, there is a simpler expression that produces exactly the wanted concatenation of all text-nodes under the specified li:

string(//div[contains(@id,"mw-content-text")]/ol/li)
like image 196
Dimitre Novatchev Avatar answered Oct 18 '22 04:10

Dimitre Novatchev


I think the following would return the correct result:

//div[contains(@id,"mw-content-text")]/ol/li//text()

Note the double slash before text(). This means text nodes on any level below li must be returned.

like image 45
iddo Avatar answered Oct 18 '22 03:10

iddo