I have a list of about 13,000 URLs that I want to extract info from, however, not every URL actually exists. In fact the majority don't. I have just tried passing all 13,000 urls through html()
but it takes a long time. I am trying to work out how to see if the urls actually exist before parsing them to html()
. I have tried using httr
and GET()
functions, as well as rcurls
and url.exists()
functions. For some reason url.exist()
always returns FALSE
values even when the URL does exist, and the way I am using GET()
always returns a success, I think this is because the page is being redirected.
The following URLs represent the type of pages I am parsing, the first does not exist
urls <- data.frame('site' = 1:3, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'))
urls$urls <- as.character(urls$urls)
For GET()
, the problem is that the second URL doesn't actually exist but it is redirected and therefore returns a "success".
urls$urlExists <- sapply(1:length(urls[,1]),
function(x) ifelse(http_status(GET(urls[x, 'urls']))[[1]] == "success", 1, 0))
For url.exists()
, I get three FALSE returned even though the first and third urls do exist.
urls$urlExists2 <- sapply(1:length(urls[,1]), function(x) url.exists(urls[x, 'urls']))
I checked these two posts 1, 2 but I would prefer not to use a useragent simply because I am not sure how to find mine or whether it would change for different people using this code on other computers. Therefore making the code harder to pick up and use by others. Both posts answers suggest using GET()
in httr
. It seems that GET()
is probably the preferred method but I would need to figure out how to deal with the redirection issue.
Can anyone suggest a good way in R to test the existence of a URL before parsing them to html()
? I would also be happy for any other suggested work around for this issue.
UPDATE:
After looking into the returned value from GET()
I figured out a work around, see answers for details.
With httr, use url_success()
and redirect following turned off:
library(httr)
urls <- c(
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
)
sapply(urls, url_success, config(followlocation = 0L), USE.NAMES = FALSE)
url_success(x)
is deprecated; please use !http_error(x)
instead.
So update the solution from hadley.
> library(httr)
>
> urls <- c(
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
> )
>
> !sapply(urls, http_error)
After a suggestion from @TimBiegeleisen I looked at what was returned from the funtion GET()
. It seems that if the url exists GET()
will return this url as a value, but if it is redirected a different url is returned. I just changed the code to look at whether the url returned by GET()
matched the one I submitted.
urls$urlExists <- sapply(1:length(urls[,1]), function(x) ifelse(GET(urls[x, 'urls'])[[1]] == urls[x,'urls'], 1, 0))
I would be interested in learning about any better methods that people use for the same thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With