Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I close unused connections after read_html in R

Tags:

r

rvest

webchem

I am quite new to R and am trying to access some information on the internet, but am having problems with connections that don't seem to be closing. I would really appreciate it if someone here could give me some advice...

Originally I wanted to use the WebChem package, which theoretically delivers everything I want, but when some of the output data is missing from the webpage, WebChem doesn't return any data from that page. To get around this, I have taken most of the code from the package but altered it slightly to fit my needs. This worked fine, for about the first 150 usages, but now, although I have changed nothing, when I use the command read_html, I get the warning message " closing unused connection 4 (http:....." Although this is only a warning message, read_html doesn't return anything after this warning is generated.

I have written a simplified code, given below. This has the same problem

Closing R completely (or even rebooting my PC) doesn't seem to make a difference - the warning message now appears the second time I use the code. I can run the querys one at a time, outside of the loop with no problems, but as soon as I try to use the loop, the error occurs again on the 2nd iteration. I have tried to vectorise the code, and again it returned the same error message. I tried showConnections(all=TRUE), but only got connections 0-2 for stdin, stdout, stderr. I have tried searching for ways to close the html connection, but I can't define the url as a con, and close(qurl) and close(ttt) also don't work. (Return errors of no applicable method for 'close' applied to an object of class "character and no applicable method for 'close' applied to an object of class "c('xml_document', 'xml_node')", repectively)

Does anybody know a way to close these connections so that they don't break my routine? Any suggestions would be very welcome. Thanks!

PS: I am using R version 3.3.0 with RStudio Version 0.99.902.

CasNrs <- c("630-08-0","463-49-0","194-59-2","86-74-8","148-79-8")
tit = character()
for (i in 1:length(CasNrs)){
  CurrCasNr <- as.character(CasNrs[i])
  baseurl <- 'http://chem.sis.nlm.nih.gov/chemidplus/rn/'
  qurl <- paste0(baseurl, CurrCasNr, '?DT_START_ROW=0&DT_ROWS_PER_PAGE=50')
  ttt <- try(read_html(qurl), silent = TRUE)
  tit[i] <- xml_text(xml_find_all(ttt, "//head/title"))
}
like image 397
user6469960 Avatar asked Jun 15 '16 15:06

user6469960


2 Answers

After researching the topic I came up with the following solution:

  url <- "https://website_example.com"
  url = url(url, "rb")
  html <- read_html(url)
  close(url)

# + Whatever you wanna do with the html since it's already saved!
like image 93
Roberto de la Iglesia Avatar answered Sep 23 '22 16:09

Roberto de la Iglesia


I haven't found a good answer for this problem. The best work-around that I came up with is to include the function below, with Secs = 3 or 4. I still don't know why the problem occurs or how to stop it without building in a large delay.

CatchupPause <- function(Secs){
 Sys.sleep(Secs) #pause to let connection work
 closeAllConnections()
 gc()
}
like image 21
nm200 Avatar answered Sep 20 '22 16:09

nm200