The following script allows me to get to a website with several links with similar names. I want to get only one of them, which can be diferentiated from the others because it is printed in bold in the website. However, i could not find a way of selecting a bold link within a list.
Would anyone have ahint on this? Thanks in advance!
library(httr)
library(rvest)
sp="Alnus japonica"
res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do",
body = list(page ="advancedSearch",
AttachmentExist ="",
family ="",
placeOfPub ="",
genus = unlist(strsplit(as.character(sp), split=" "))[1],
yearPublished ="",
species = unlist(strsplit(as.character(sp), split=" "))[2],
author ="",
infraRank ="",
infraEpithet ="",
selectedLevel ="cont"),
encode ="form")
pg <- content(res, as="parsed")
lnks <- html_attr(html_nodes(pg,"a"),"href")
#how get the url of the link wth accepted name (in bold)?
res2 <- try(GET(sprintf("http://apps.kew.org%s", lnks[grep("id=",lnks)] [1])),silent=T)
#this gets a link but often fails to get the bold one
First, grab tidy-html5
(it works on pretty much everything) and install it and ensure it's in your PATH
.
As my comment said, browsers handle <b>
outside <p>
as they need to be bulletproof. libxml2
does not. So, we need to clean this up first (and I now need to make a new tidyhtml
package) and then process the tidied version:
library(xml2)
library(httr)
library(rvest)
res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do",
body = list(page ="advancedSearch",
AttachmentExist ="",
family ="",
placeOfPub ="",
genus = "Alnus",
yearPublished ="",
species = "japonica",
author ="",
infraRank ="",
infraEpithet ="",
selectedLevel ="cont"),
encode ="form")
tf <- tempfile(fileext=".html")
cat(content(res, as="text"), file=tf)
tidy <- system2("tidy", c("-q", tf), TRUE)
pg <- read_html(paste0(tidy, sep="", collapse=""))
html_nodes(pg, xpath=".//p/b/a[contains(@href, 'name_id')]")
## {xml_nodeset (1)}
## [1] <a href="/wcsp/namedetail.do?name_id=6471" class="onwa ...
If CSS selectors are desired over XPath:
html_nodes(pg, "p > b > a[href*='name_id']")
UPDATE
I started a basic pkg wrapper for libtidy
. If you're on OS X and use Homebrew you can do: brew install tidy-html5
(which installs the binary above and the libtidy
library) and devtools::install_github("hrbrmstr/tidyhtml")
to install the pkg. Then, it's just:
library(xml2)
library(httr)
library(rvest)
library(htmltidy)
res <- httr::POST(url ="http://apps.kew.org/wcsp/advsearch.do",
body = list(page ="advancedSearch",
AttachmentExist ="",
family ="",
placeOfPub ="",
genus = "Alnus",
yearPublished ="",
species = "japonica",
author ="",
infraRank ="",
infraEpithet ="",
selectedLevel ="cont"),
encode ="form")
tidy_html <- tidy(content(res, as="text"))
pg <- read_html(tidy_html)
html_nodes(pg, "p > b > a[href*='name_id']")
I should be able to get this to work on Windows & linux and make it a real package (it's a thin wrapper w/o error checking now) but that'll be down on the TODO for a while.
Seems to me like there might be a bug with rvest
/httr
here, as <b>
appears to surround <a href...>
on the relevant link, but not in the parsed version.
I used:
library(rvest)
sp=strsplit("Alnus japonica", " ")[[1]]
session <- html_session("http://apps.kew.org/wcsp/advsearch.do")
form <- html_form(session)[[1]]
filled_form <- set_values(form, genus = sp[1], species = sp[2])
out <- submit_form(session, filled_form)
Look at the following:
out %>% html_nodes(xpath = "descendant-or-self::*") %>% `[`(81:90)
# {xml_nodeset (10)}
# [1] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
# [2] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
# [3] <i>Alnus</i>
# [4] <i> japonica</i>
# [5] <b>\n </b>
# [6] <p><a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
# [7] <a href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A?nam ...
# [8] <i>Alnus</i>
# [9] <i> japonica</i>
# [10] <p><a # href="/wcsp/namedetail.do;jsessionid=F6180417706056852E58C1E290B5087A? ...
As you can see, the <b>
node appears empty. However, when I enter the search manually and View Source
on Chrome, I see:
<b>
<p><a href="/wcsp/namedetail.do?name_id=6471" class="onwardnav"><i>Alnus</i><i> japonica</i> (Thunb.) Steud., Nomencl. Bot., ed. 2, 1: 55 (1840).</a>
</p>
</b>
That the relevant <a>
is between <b>
and </b>
tells me it should be a child of that <b>
, but this comes up blank:
out %>% html_nodes(xpath = "//b/child::*")
I'm admittedly no xpath
expert, so I could be mucking things up here. Hope this helps get you on your way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With