Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Check existence of url, problems with httr:GET() and url.exists()

Tags:

html

url

r

get

httr

I have a list of about 13,000 URLs that I want to extract info from, however, not every URL actually exists. In fact the majority don't. I have just tried passing all 13,000 urls through html() but it takes a long time. I am trying to work out how to see if the urls actually exist before parsing them to html(). I have tried using httr and GET() functions, as well as rcurls and url.exists() functions. For some reason url.exist() always returns FALSE values even when the URL does exist, and the way I am using GET() always returns a success, I think this is because the page is being redirected.

The following URLs represent the type of pages I am parsing, the first does not exist

urls <- data.frame('site' = 1:3, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010', 
                            'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
                            'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'))

urls$urls <- as.character(urls$urls)

For GET(), the problem is that the second URL doesn't actually exist but it is redirected and therefore returns a "success".

 urls$urlExists <- sapply(1:length(urls[,1]), 
                     function(x) ifelse(http_status(GET(urls[x, 'urls']))[[1]] == "success", 1, 0))

For url.exists(), I get three FALSE returned even though the first and third urls do exist.

 urls$urlExists2 <- sapply(1:length(urls[,1]), function(x) url.exists(urls[x, 'urls']))

I checked these two posts 1, 2 but I would prefer not to use a useragent simply because I am not sure how to find mine or whether it would change for different people using this code on other computers. Therefore making the code harder to pick up and use by others. Both posts answers suggest using GET() in httr. It seems that GET() is probably the preferred method but I would need to figure out how to deal with the redirection issue.

Can anyone suggest a good way in R to test the existence of a URL before parsing them to html()? I would also be happy for any other suggested work around for this issue.

UPDATE:

After looking into the returned value from GET() I figured out a work around, see answers for details.

like image 712
Adam Avatar asked Jul 15 '15 01:07

Adam


3 Answers

With httr, use url_success() and redirect following turned off:

library(httr)

urls <- c(
  'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010', 
  'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
  'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
)

sapply(urls, url_success, config(followlocation = 0L), USE.NAMES = FALSE)
like image 65
hadley Avatar answered Nov 06 '22 01:11

hadley


url_success(x) is deprecated; please use !http_error(x) instead.

So update the solution from hadley.

> library(httr)
> 
> urls <- c(  
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=HMM202',
> 'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'
> )
> 
> !sapply(urls, http_error)
like image 4
Shixiang Wang Avatar answered Nov 06 '22 00:11

Shixiang Wang


After a suggestion from @TimBiegeleisen I looked at what was returned from the funtion GET(). It seems that if the url exists GET() will return this url as a value, but if it is redirected a different url is returned. I just changed the code to look at whether the url returned by GET() matched the one I submitted.

urls$urlExists <- sapply(1:length(urls[,1]), function(x) ifelse(GET(urls[x, 'urls'])[[1]] == urls[x,'urls'], 1, 0))

I would be interested in learning about any better methods that people use for the same thing.

like image 2
Adam Avatar answered Nov 06 '22 00:11

Adam