Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath to extract text after br tags in R

Tags:

r

xpath

How to extract text after the br tags in the following lines:

<div id='population'>
    The Snow Leopard Survival Strategy (McCarthy <em>et al.</em> 2003, Table
    II) compiled national snow leopard population estimates, updating the work
    of Fox (1994). Many of the estimates are acknowledged to be rough and out
    of date, but the total estimated population is 4,080-6,590, as follows:<br>
    <br>
    Afghanistan: 100-200?<br>
    Bhutan: 100-200?<br>
    China: 2,000-2,500<br>
    India: 200-600<br>
    Kazakhstan: 180-200<br>
    Kyrgyzstan: 150-500<br>
    Mongolia: 500-1,000<br>
    Nepal: 300-500<br>
    Pakistan: 200-420<br>
    Russia: 150-200<br>
    Tajikistan: 180-220<br>
    Uzbekistan: 20-50
</div>

I got as far as:

xpathSApply(h, '//div[@id="population"]', xmlValue)

but I'm stuck now...

like image 726
Kay Avatar asked Jun 28 '12 20:06

Kay


1 Answers

It helps if you realize text is a node too. All text in the div than follows <br/>'s can be retrieved by:

//div[@id="population"]/text()[preceding-sibling::br]

Technically, between <br/> tags would mean:

//div[@id="population"]/text()[preceding-sibling::br and following-sibling::br]

... but I guess that's not what you want at this point.

like image 57
Wrikken Avatar answered Oct 06 '22 06:10

Wrikken