The following url has figures and table(s), and I like to read first two columns of a table. xpatahSApply command is working fine, but I need to condition on more than two attributes and I am unable to figure out.
url ="http://floodobservatory.colorado.edu/SiteDisplays/1544data.htm"
doc=htmlTreeParse(url,useInternal=TRUE)
sample of the parsed data
<tr height="20" style="height:15.0pt">
<td height="20" class="xl6521398" align="right" style="height:15.0pt">11-Oct-13</td>
<td class="xl7321398">1853</td>
<td class="xl7321398"></td>
<td class="xl8121398">0.80</td>
<td class="xl7221398" align="right">4.87</td>
<td class="xl1521398"></td>
<td class="xl1521398"></td>
<td class="xl1521398"></td>
<td class="xl1521398"></td>
<td class="xl1521398"></td>
<td class="xl1521398"></td>
<td class="xl7421398"></td>
<td class="xl7421398"></td>
<td class="xl7421398"></td>
<td class="xl7421398"></td>
<td class="xl9621398"></td>
<td class="xl7421398"></td>
<td class="xl8121398"></td>
</tr>
I need to read values from two cells of which one corresponds to date and the other one corresponds to streamflow discharge and has below mentioned attributes
<td height="20" class="xl6521398" ...> and [<td class="xl7321398"..]
with respect to above sample data, I need to grab "11-Oct-13" and "1853".
I used following commands to get 'dates' and 'streamflow discharge'.
dates=xpathSApply(doc,"//td[@class='xl6521398']",xmlValue)
streamflowdischarge=xpathSApply(doc,"//td[@class='xl7321398']",xmlValue)
They successfully extracted information, but extracted values consist values from other tables/cells, and importantly 'dates' and 'streamflow discharge' do not correspond.
dates[1:10] [1] "1-Jan-98" "2-Jan-98" "3-Jan-98" "31-Mar-98" "4-Jan-98" "30-Apr-98" "5-Jan-98" [8] "31-May-98" "6-Jan-98" "30-Jun-98"
"31-Mar-98" is between "3-Jan-98" and "4-Jan-98" - something unintended
streamflowdischarge[1:10] [1] "3108" "3076" "3051" "3111" "3064" "3043" "3007" "3066" "378" ""
"3108" does not correspond to "1-Jan-98" - can be checked at the url
It looks like there are tables/cells with same attributes, which I do not want read/grab. In this regard, I think I need to pass the entire attribute, i.e.,
<td height="20" class="xl6521398" align="right" style="height:15.0pt">
to get the 'date', and somehow I should condition such that 'streamflow discharge' from the same table is extracted.
Greatly appreciate suggestions, and also if there are other options available.
I tried readHTMLTable, but got an error "subscript out of bounds"
Thanks, Satish
I input the data
url = "http://floodobservatory.colorado.edu/SiteDisplays/1544data.htm"
html = htmlParse(url)
then queried for the table rows containing both of the cell class you are interested in, taking the first or the second cell of each
query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]/td[1]"
dates = xpathSApply(html, query, xmlValue)
query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]/td[2]"
flows = xpathSApply(html, query, xmlValue)
These are I think what you want
> df = data.frame(dates=as.Date(dates, "%e-%b-%y"), flows=as.integer(flows))
> nrow(df)
[1] 5808
> head(df, 3)
dates flows
1 1-Jan-98 1258
2 2-Jan-98 1584
3 3-Jan-98 1272
> tail(df, 3)
dates flows
5806 23-Nov-13 2878
5807 24-Nov-13 2852
5808 25-Nov-13 2738
I guess the secret was to use the selection of rows with the two columns of interest (?? but maybe these are classes generated by the spreadsheet used to make the web page, and have nothing to do with the semantic meaning of the data?) to group the data. A more 'complete' scraping might create a node set of the rows, and then query the rows (for the sometimes several) columns labelled with the class of interest, e.g.,
query = "//tr[./td[@class='xl6521398'] and ./td[@class='xl7321398']]"
nodes = getNodeSet(html, query)
date = lapply(nodes, xpathSApply, "./td[@class='xl6521398']", xmlValue)
flow = lapply(nodes, xpathSApply, "./td[@class='xl7321398']", xmlValue)
The date and flow elements are coordinated, but there can be several flow measurements per date.
> head(flow, 3)
[[1]]
[1] "1258" "" "1799" "2621" "1258"
[[2]]
[1] "1584" "" "1550" "2033" "978"
[[3]]
[1] "1272" "" "1104" "3515" "233"
> table(sapply(flow, length))
2 3 5
5577 15 216
So I guess this is for the Blue Nile, in the Sudan; neat
url = "http://floodobservatory.colorado.edu/SiteDisplays/Summary5.htm"
sites = htmlParse(url)
> sites["//tr[./td[1] = '1544']"]
[[1]]
<tr height="17" style="height:12.75pt"><td height="17" class="xl7226158" style="height:12.75pt">1544</td>
<td class="xl6926158"/>
<td class="xl7026158">13.0940</td>
<td class="xl7026158">33.9750</td>
<td class="xl6926158">5070</td>
<td class="xl6926158">Blue Nile</td>
<td class="xl6926158">Sudan</td>
<td class="xl6926158">2</td>
<td class="xl6926158">2</td>
<td class="xl7926158">173%</td>
<td class="xl8226158">15.88</td>
<td class="xl7126158">19-Nov-14</td>
<td class="xl7126158"/>
</tr>
attr(,"class")
[1] "XMLNodeSet"
You can use and
and |
operators within xpath :
path_xp <- '//td[@class="xl6521398" and @height="20"]|//td[@class="xl7321398"]'
res <- xpathSApply(doc,path_xp,xmlValue)
[1] "11-Oct-13" "1853" ""
Note that you have 3 elements here because you have 2 elments with attribute class equal to xl7321398. Maybe you should precise more your request or you can just move the third empty element.
res[nzchar(res)]
[1] "11-Oct-13" "1853"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With