Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text in between nodes through XPath

Tags:

xpath

I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...

<root>
    <div class="textfield">
        <div class="header">First item</div>
        Here is the text of the <strong>first</strong> item.
        <div class="header">Second item</div>
        <span>Here is the text of the second item.</span>
        <div class="header">Third item</div>
        Here is the text of the third item.
    </div>
    <div class="textfield">
        Footer text
    </div>
</root>

I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:

//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'Second item')]]

However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').

Any help on how to adapt my XPath query would be greatly appreciated.

like image 528
Michiel Meulendijk Avatar asked Apr 16 '12 22:04

Michiel Meulendijk


2 Answers

For the sake of completeness, the final query, composed of various suggestions throughout the thread:

//*[
    @class='textfield' and position() = 1
]
//text() [
    preceding::*[
        @class='header' and contains(text(),'First item')
    ]
][
    following::*[
        preceding::*[
            @class='header'
        ][1][
            contains(text(),'First item')
        ]
    ]
]
like image 195
Michiel Meulendijk Avatar answered Nov 28 '22 15:11

Michiel Meulendijk


//*[@class='header' and contains(text(),'First item')]/following::text()[1] will select first text node after <div class="header">First item</div>.
//*[@class='header' and contains(text(),'Second item')]/following::text()[1] will select first text node after <div class="header">Second item</div> and so on
EDIT: Sorry, this will not work for <strong> cases. Will update my answer
EDIT2: Used @Michiel part. Looks like omg but works: //div[@class='textfield'][1]//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[@class='header' and contains(text(),'First item')]])]
Seems that this should be solved with a better solution :)

like image 45
Aleh Douhi Avatar answered Nov 28 '22 15:11

Aleh Douhi