Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting data using XPathSApply conditioning on more than one attribute

The following url has figures and table(s), and I like to read first two columns of a table. xpatahSApply command is working fine, but I need to condition on more than two attributes and I am unable to figure out.

url ="http://floodobservatory.colorado.edu/SiteDisplays/1544data.htm"

doc=htmlTreeParse(url,useInternal=TRUE)

sample of the parsed data

<tr height="20" style="height:15.0pt">
<td height="20" class="xl6521398" align="right" style="height:15.0pt">11-Oct-13</td>
  <td class="xl7321398">1853</td>
  <td class="xl7321398"></td>
  <td class="xl8121398">0.80</td>
  <td class="xl7221398" align="right">4.87</td>
  <td class="xl1521398"></td>
  <td class="xl1521398"></td>
  <td class="xl1521398"></td>
  <td class="xl1521398"></td>
  <td class="xl1521398"></td>
  <td class="xl1521398"></td>
  <td class="xl7421398"></td>
  <td class="xl7421398"></td>
  <td class="xl7421398"></td>
  <td class="xl7421398"></td>
  <td class="xl9621398"></td>
  <td class="xl7421398"></td>
  <td class="xl8121398"></td>
 </tr>

I need to read values from two cells of which one corresponds to date and the other one corresponds to streamflow discharge and has below mentioned attributes

<td height="20" class="xl6521398" ...> and  [<td class="xl7321398"..]

with respect to above sample data, I need to grab "11-Oct-13" and "1853".

I used following commands to get 'dates' and 'streamflow discharge'.

dates=xpathSApply(doc,"//td[@class='xl6521398']",xmlValue)

streamflowdischarge=xpathSApply(doc,"//td[@class='xl7321398']",xmlValue)

They successfully extracted information, but extracted values consist values from other tables/cells, and importantly 'dates' and 'streamflow discharge' do not correspond.

dates[1:10] [1] "1-Jan-98" "2-Jan-98" "3-Jan-98" "31-Mar-98" "4-Jan-98" "30-Apr-98" "5-Jan-98" [8] "31-May-98" "6-Jan-98" "30-Jun-98"

"31-Mar-98" is between "3-Jan-98" and "4-Jan-98" - something unintended

streamflowdischarge[1:10] [1] "3108" "3076" "3051" "3111" "3064" "3043" "3007" "3066" "378" ""

"3108" does not correspond to "1-Jan-98" - can be checked at the url

It looks like there are tables/cells with same attributes, which I do not want read/grab. In this regard, I think I need to pass the entire attribute, i.e.,

<td height="20" class="xl6521398" align="right" style="height:15.0pt">

to get the 'date', and somehow I should condition such that 'streamflow discharge' from the same table is extracted.

Greatly appreciate suggestions, and also if there are other options available.

I tried readHTMLTable, but got an error "subscript out of bounds"

Thanks, Satish

like image 846
SatishR Avatar asked Nov 19 '14 22:11

SatishR


2 Answers

I input the data

url = "http://floodobservatory.colorado.edu/SiteDisplays/1544data.htm"
html = htmlParse(url)

then queried for the table rows containing both of the cell class you are interested in, taking the first or the second cell of each

query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]/td[1]"
dates = xpathSApply(html, query, xmlValue)
query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]/td[2]"
flows = xpathSApply(html, query, xmlValue)

These are I think what you want

> df = data.frame(dates=as.Date(dates, "%e-%b-%y"), flows=as.integer(flows))
> nrow(df)
[1] 5808
> head(df, 3)
     dates flows
1 1-Jan-98  1258
2 2-Jan-98  1584
3 3-Jan-98  1272
> tail(df, 3)
         dates flows
5806 23-Nov-13  2878
5807 24-Nov-13  2852
5808 25-Nov-13  2738

I guess the secret was to use the selection of rows with the two columns of interest (?? but maybe these are classes generated by the spreadsheet used to make the web page, and have nothing to do with the semantic meaning of the data?) to group the data. A more 'complete' scraping might create a node set of the rows, and then query the rows (for the sometimes several) columns labelled with the class of interest, e.g.,

query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]"
nodes = getNodeSet(html, query)
date = lapply(nodes, xpathSApply, "./td[@class='xl6521398']", xmlValue)
flow = lapply(nodes, xpathSApply, "./td[@class='xl7321398']", xmlValue)

The date and flow elements are coordinated, but there can be several flow measurements per date.

> head(flow, 3)
[[1]]
[1] "1258" ""     "1799" "2621" "1258"

[[2]]
[1] "1584" ""     "1550" "2033" "978" 

[[3]]
[1] "1272" ""     "1104" "3515" "233" 

> table(sapply(flow, length))

   2    3    5 
5577   15  216 

So I guess this is for the Blue Nile, in the Sudan; neat

url = "http://floodobservatory.colorado.edu/SiteDisplays/Summary5.htm"
sites = htmlParse(url)

> sites["//tr[./td[1] = '1544']"]
[[1]]
<tr height="17" style="height:12.75pt"><td height="17" class="xl7226158" style="height:12.75pt">1544</td>&#13;
  <td class="xl6926158"/>&#13;
  <td class="xl7026158">13.0940</td>&#13;
  <td class="xl7026158">33.9750</td>&#13;
  <td class="xl6926158">5070</td>&#13;
  <td class="xl6926158">Blue Nile</td>&#13;
  <td class="xl6926158">Sudan</td>&#13;
  <td class="xl6926158">2</td>&#13;
  <td class="xl6926158">2</td>&#13;
  <td class="xl7926158">173%</td>&#13;
  <td class="xl8226158">15.88</td>&#13;
  <td class="xl7126158">19-Nov-14</td>&#13;
  <td class="xl7126158"/>&#13;
 </tr> 

attr(,"class")
[1] "XMLNodeSet"
like image 76
Martin Morgan Avatar answered Sep 24 '22 03:09

Martin Morgan


You can use and and | operators within xpath :

path_xp <-  '//td[@class="xl6521398" and  @height="20"]|//td[@class="xl7321398"]'

res <- xpathSApply(doc,path_xp,xmlValue)
[1] "11-Oct-13" "1853"      "" 

Note that you have 3 elements here because you have 2 elments with attribute class equal to xl7321398. Maybe you should precise more your request or you can just move the third empty element.

res[nzchar(res)]
[1] "11-Oct-13" "1853" 
like image 40
agstudy Avatar answered Sep 23 '22 03:09

agstudy