Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to refresh or retry a specific web page using httr GET command?

Tags:

r

timeout

get

httr

I need to access the same web page with different "keys" to get specific content it provides.

I have a list of keys x and I use the GET command from httr package to access the web page and then retrieve the information I need y.

library(httr)
library(stringr)
library(XML)

for (i in 1:20){
    h1 = GET ( paste0("http:....categories=&query=", x[i]),timeout(10))
    par = htmlParse(file = h1)

    y[i]=xpathSApply(doc = par, path = "//h3/a" , fun=xmlValue)

}

The problem is that timeout is often reached, and it disrupts the loop.

So I would like to refresh the web page or retry the GET command if timeout is reached, because I suspect the problem is with the internet connection of the website I am trying to access.

The way my code works, timeout breaks the loop. I need to either ignore the error and go to next iteration or retry to access the website.

like image 611
Felipe Alvarenga Avatar asked May 21 '16 20:05

Felipe Alvarenga


2 Answers

Look at purrr::safely(). You can wrap GET as such:

safe_GET <- purrr::safely(GET)

This removes the ugliness of tryCatch() by letting you do:

resp <- safe_GET("http://example.com") # you can use all legal `GET` params

And you can test resp$result for NULL. Put that into your retry loop and you're good to go.

You can see this in action by doing:

str(safe_GET("https://httpbin.org/delay/3", timeout(1)))

which will ask the httpbin service to wait 3s before responding but set an explicit timeout on the GET request to 1s. I wrapped it in str() to show the result:

List of 2
 $ result: NULL
 $ error :List of 2
  ..$ message: chr "Timeout was reached"
  ..$ call   : language curl::curl_fetch_memory(url, handle = handle)
  ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

So, you can even check the message if you need to.

like image 178
hrbrmstr Avatar answered Oct 06 '22 00:10

hrbrmstr


http_status(h1) can help you know where the problem lies :

a <- http_status(GET("http://google.com"))
a

$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"

and

b <- http_status(GET("http://google.com/blablablablaba"))
b

$category
[1] "Client error"

$reason
[1] "Not Found"

$message
[1] "Client error: (404) Not Found"

See this list of HTTP status codes to know what the code you get means.

Moreover, tryCatch can help you achieve what you want :

tryCatch({GET(h1)}, error = function(e){print("error")})
like image 20
François M. Avatar answered Oct 05 '22 23:10

François M.