To be able to access the NIST Chemistry Webbook database from R I need to be able to pass some query to a URL encoded web address. Most of the time this conversion works fine with URLencode(), but in some cases not. One case where it fails e.g. is for
query="Poligodial + 3-methoxy-4,5-methylenedioxyamphetamine (R,S) adduct, # 1"
which I tried to fetch using
library(XML)
library(RCurl)
url=URLencode(paste0('http://webbook.nist.gov/cgi/cbook.cgi?Name=',query,'&Units=SI'))
doc=htmlParse(getURL(url),encoding="UTF-8")
however if you try this url in your web browser http://webbook.nist.gov/cgi/cbook.cgi?Name=Poligodial%20+%203-methoxy-4,5-methylenedioxyamphetamine%20(R,S)%20adduct,%20%23%201&Units=SI it gives name not found. Apparently, if you try the query from http://webbook.nist.gov/chemistry/name-ser.html it is expecting the URL encoded string
"http://webbook.nist.gov/cgi/cbook.cgi?Name=Poligodial+%2B+3-methoxy-4%2C5-methylenedioxyamphetamine+%28R%2CS%29+adduct%2C+%23+1&Units=SI"
Does anybody have any idea what kind of gsub rules I should use to arrive at the same kind of URL encoding in this case? Or is there some other easy fix?
I tried with
url=gsub(" ","+",gsub(",","%2C",gsub("+","%2B",URLencode(paste('http://webbook.nist.gov/cgi/cbook.cgi?Name=',query,'&Units=SI', sep="")),fixed=T),fixed=T),fixed=T)
but that still wasn't quite right, and I have no idea what rules the owner of the web site could have used...
URLencode follows the RFC1738 specification (see section 2.2, page 3), which states that:
only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.
That is, it doesn't encode plusses or commas or parentheses. So the URL it generate is correct in theory but not in practise.
The GET function in the httr package that Scott mentioned calls curlEscape from RCurl, which encodes these punctuation characters.
(GET calls handle_url which calls modify_url which calls build_url which calls curlEscape.)
The URL it generates is
paste0('http://webbook.nist.gov/cgi/cbook.cgi?Name=', curlEscape(query), '&Units=SI')
## [1] "http://webbook.nist.gov/cgi/cbook.cgi?Name=Poligodial%20%2B%203%2Dmethoxy%2D4%2C5%2Dmethylenedioxyamphetamine%20%28R%2CS%29%20adduct%2C%20%23%201&Units=SI"
This seems to work OK.
httr has nice features and you may want to start using it. The minimal change to your code to get things working is simply to swap URLencode for curlEscape.
Does this do what you want?
library(httr)
url <- 'http://webbook.nist.gov/cgi/cbook.cgi'
args <- list(Name = "Poligodial + 3-methoxy-4,5-methylenedioxyamphetamine (R,S) adduct, # 1",
Units = 'SI')
res <- GET(url, query=args)
content(res)$children$html
Gives
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta http-equiv="Window-target" content="_top"/>
...etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With