Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rvest Error in open.connection(x, "rb") : Timeout was reached

Tags:

r

rvest

I'm trying to scrape the content from http://google.com. the error message come out.

library(rvest)  
html("http://google.com")

Error in open.connection(x, "rb") :
Timeout was reached In addition:
Warning message: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")

since I'm using company network ,this maybe caused by firewall or proxy. I try to use set_config ,but not working .

like image 298
user3267649 Avatar asked Oct 23 '15 05:10

user3267649


3 Answers

I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network.

Here's what worked for me,

library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")

Credit : https://stackoverflow.com/a/38463559

like image 163
user799188 Avatar answered Nov 20 '22 16:11

user799188


This is probably an issue with your call to read_html (or html in your case) not properly identifying itself to server it's trying to retrieve content from, which is the default behaviour. Using curl, add a user agent to the handle argument of read_html to have your scraper identify itself.

library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
like image 24
genericgreatape Avatar answered Nov 20 '22 16:11

genericgreatape


I ran into this issue because my VPN was switched on. Immediately after turning it off, I re-tried, and it resolved the issue.

like image 1
Brent Brewington Avatar answered Nov 20 '22 17:11

Brent Brewington