Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to set timeout in rvest

Tags:

r

timeout

rvest

Simple question: this code x <- read_html(url) hangs and reads page infinite amount of seconds. I don't know how to handle this, for example, by setting some maximum time for response. I could use try, catch, whatever to retry. But it just hangs and nothing happens. Anyone know how to deal with it?

There's no problem with page, it occurs sometimes, and while I retry manually it works.

like image 262
Peter.k Avatar asked Feb 04 '23 01:02

Peter.k


1 Answers

You can wrap read_html in the GET function from httr package

e.g. if your original code was

library(rvest)
library(dplyr)

my_url <- "https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest"
x <- my_url %>% read_html(.)

then you could replace it with

library(httr)

# Allow 10 seconds
my_url %>% GET(., timeout(10)) %>% read_html

# Allow 30 seconds
my_url %>% GET(., timeout(30)) %>% read_html

Example

To put it to the test, try setting an extremely short timeout period (e.g. a hundredth of a second)

# Allow an unreasonably short amount of time so the request errors rather than hangs indefinitely

my_url %>% GET(., timeout(0.01)) %>% read_html

# Error in curl::curl_fetch_memory(url, handle = handle) : 
#   Timeout was reached: Resolving timed out after 10 milliseconds

You can find some more examples here

Using it in a loop (e.g. 'skip to the next if timed out)

Try running this code. It supposes you have a number (3 in this case) of urls to visit (one the second url below will delay 3 seconds before providing the html - a great way to test the functionality you're looking for). We set the timeout for 2 seconds so we know it will fail. The tryCatch() function will simply execute whatever code you put in as its second argument; in this case it will simply assign 'Timed out!' to the list element


my_urls <- c("https://stackoverflow.com/questions/48722076/how-to-set-timeout-in-rvest",
             "http://httpbin.org/delay/3", # This url will delay 3 seconds
             "http://httpbin.org/delay/1") 

x <- list()

# Set timeout for 2 seconds (so second url will fail)
for (i in 1:length(my_urls)) {

  print(paste0("Scraping url number ", i))

  tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html,
           error = function(e) { x[[i]] <<- "Timed out!" } )
  
}

Now we inspect the output - the first and third sites returned content, the second timed out

# > x
# [[1]]
# {xml_document}
# <html itemscope="" itemtype="http://schema.org/QAPage" class="html__responsive">
#   [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>r - how to set timeout ...
# [2] <body class="question-page unified-theme">\r\n    <div id="notify-container"></div>\r\n    <div id="custom ...
# 
# [[2]]
# [1] "Timed out!"
# 
# [[3]]
# {xml_document}
# <html>
# [1] <body><p>{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {}, \n  "headers": {\n    "Accept": ...


Obviously you can set the timeout value to whatever you want. 30 - 60 seconds could be sensible depending on the use.

like image 112
stevec Avatar answered Feb 08 '23 23:02

stevec