I am using the rvest
package to scrape information from the page http://www.radiolab.org/series/podcasts. After scraping the first page, I want to follow the "Next" link at the bottom, scrape that second page, move onto the third page, etc.
The following line gives an error:
html_session("http://www.radiolab.org/series/podcasts") %>% follow_link("Next")
## Navigating to
##
## ./2/
## Error in parseURI(u) : cannot parse URI
##
## ./2/
Inspecting the HTML shows there is some extra cruft around the "./2/" that rvest
apparently doesn't like:
html("http://www.radiolab.org/series/podcasts") %>% html_node(".pagefooter-next a")
## <a href=" ./2/ ">Next</a>
.Last.value %>% html_attrs()
## href
## "\n \n ./2/ "
Question 1:
How can I get rvest::follow_link
to treat this link correctly like my browser does? (I could manually grab the "Next" link and clean it up with regex, but prefer to take advantage of the automation provided with rvest
.)
At the end of the follow_link
code, it calls jump_to
. So I tried the following:
html_session("http://www.radiolab.org/series/podcasts") %>% jump_to("./2/")
## <session> http://www.radiolab.org/series/2/
## Status: 404
## Type: text/html; charset=utf-8
## Size: 10744
## Warning message:
## In request_GET(x, url, ...) : client error: (404) Not Found
Digging into the code, it looks like the issue is with XML::getRelativeURL
, which uses dirname
to strip off the last part of the original path ("/podcasts"):
XML::getRelativeURL("./2/", "http://www.radiolab.org/series/podcasts/")
## [1] "http://www.radiolab.org/series/./2"
XML::getRelativeURL("../3/", "http://www.radiolab.org/series/podcasts/2/")
## [1] "http://www.radiolab.org/series/3"
Question 2:
How can I get rvest::jump_to
and XML::getRelativeURL
to correctly handle relative paths?
Since this problem still seems to occur with RadioLab.com, your best solution is to create a custom function to handle this edge case. If you're only worried about this site - and this particular error - then you can write something like this:
library(rvest)
follow_next <- function(session, text ="Next", ...) {
link <- html_node(session, xpath = sprintf("//*[text()[contains(.,'%s')]]", text))
url <- html_attr(link, "href")
url = trimws(url)
url = gsub("^\\.{1}/", "", url)
message("Navigating to ", url)
jump_to(session, url, ...)
}
That would allow you to write code like this:
html_session("http://www.radiolab.org/series/podcasts") %>%
follow_next()
#> Navigating to 2/
#> <session> http://www.radiolab.org/series/podcasts/2/
#> Status: 200
#> Type: text/html; charset=utf-8
#> Size: 61261
This is not per se an error - the URL on RadioLab is malformed, and failing to parse a malformed URL is not a bug. If you want to be liberal in how you handle the issue you need to manually work around it.
Note that you could also use RSelenium
to launch an actual browser (e.g. Chrome) and have that perform the URL parsing for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With