Simple question: this code x <- read_html(url)
hangs and reads page infinite amount of seconds. I don't know how to handle this, for example, by setting some maximum time for response. I could use try, catch, whatever to retry. But it just hangs and nothing happens. Anyone know how to deal with it?
There's no problem with page, it occurs sometimes, and while I retry manually it works.
You can wrap read_html
in the GET
function from httr
package
e.g. if your original code was
library(rvest)
library(dplyr)
my_url <- "https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest"
x <- my_url %>% read_html(.)
then you could replace it with
library(httr)
# Allow 10 seconds
my_url %>% GET(., timeout(10)) %>% read_html
# Allow 30 seconds
my_url %>% GET(., timeout(30)) %>% read_html
To put it to the test, try setting an extremely short timeout period (e.g. a hundredth of a second)
# Allow an unreasonably short amount of time so the request errors rather than hangs indefinitely
my_url %>% GET(., timeout(0.01)) %>% read_html
# Error in curl::curl_fetch_memory(url, handle = handle) :
# Timeout was reached: Resolving timed out after 10 milliseconds
You can find some more examples here
Try running this code. It supposes you have a number (3 in this case) of urls to visit (one the second url below will delay 3 seconds before providing the html - a great way to test the functionality you're looking for). We set the timeout for 2 seconds so we know it will fail. The tryCatch()
function will simply execute whatever code you put in as its second argument; in this case it will simply assign 'Timed out!' to the list element
my_urls <- c("https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest",
"http://httpbin.org/delay/3", # This url will delay 3 seconds
"http://httpbin.org/delay/1")
x <- list()
# Set timeout for 2 seconds (so second url will fail)
for (i in 1:length(my_urls)) {
print(paste0("Scraping url number ", i))
tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html,
error = function(e) { x[[i]] <<- "Timed out!" } )
}
Now we inspect the output - the first and third sites returned content, the second timed out
# > x
# [[1]]
# {xml_document}
# <html itemscope="" itemtype="http://schema.org/QAPage" class="html__responsive">
# [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>r - how to set timeout ...
# [2] <body class="question-page unified-theme">\r\n <div id="notify-container"></div>\r\n <div id="custom ...
#
# [[2]]
# [1] "Timed out!"
#
# [[3]]
# {xml_document}
# <html>
# [1] <body><p>{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {}, \n "headers": {\n "Accept": ...
Obviously you can set the timeout value to whatever you want. 30 - 60 seconds could be sensible depending on the use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With