I'm getting stuck on cookies when trying to download a PDF.
For example, if I have a DOI for a PDF document on the Archaeology Data Service, it will resolve to this landing page with an embedded link in it to this pdf but which really redirects to this other link.
library(httr)
will handle resolving the DOI and we can extract the pdf URL from the landing page using library(XML)
but I'm stuck at getting the PDF itself.
If I do this:
download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")
then I receive a HTML file that is the same as http://archaeologydataservice.ac.uk/myads/
Trying the answer at How to use R to download a zipped file from a SSL page that requires cookies leads me to this:
library(httr)
terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")
# Accept the terms on the form,
# generating the appropriate cookies
POST(terms, body = values)
GET(download, query = values)
# Actually download the file (this will take a while)
resp <- GET(download, query = values)
# write the content of the download to a binary file
writeBin(content(resp, "raw"), "c:/temp/thefile.zip")
But after the POST
and GET
functions I simply get the HTML of the same cookie page that I got with download.file
:
> GET(download, query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
Date: 2016-01-06 00:35
Status: 200
Content-Type: text/html;charset=UTF-8
Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; c...
<title>Archaeology Data Service: myADS</title>
<link href="http://archaeologydataservice.ac.uk/css/u...
...
Looking at http://archaeologydataservice.ac.uk/about/Cookies it seems that the cookie situation at this site is complicated. Seems like this kind of cookie complexity is not unusual for UK data providers: automating the login to the uk data service website in R with RCurl or httr
How can I use R to get past the cookies on this website?
Your plea on rOpenSci has been heard!
There's lots of javascript between those pages that makes it somewhat annoying to try to decipher via httr
+ rvest
. Try RSelenium
. This worked on OS X 10.11.2, R 3.2.3 & Firefox loaded.
library(RSelenium)
# check if a sever is present, if not, get a server
checkForServer()
# get the server going
startServer()
dir.create("~/justcreateddir")
setwd("~/justcreateddir")
# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
`browser.download.folderList` = as.integer(2),
`browser.download.dir` = getwd(),
`pdfjs.disabled` = TRUE,
`plugin.scan.plid.all` = FALSE,
`plugin.scan.Acrobat` = "99.0",
`browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()
# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")
# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))
# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))
Now wait for the download to complete. The R console will not be busy while it downloads, so it is easy to close the session accidently, before the download has completed.
# close the session
dr$close()
This answer came from John Harrison by email, posted here at his request:
This will allow you to download the PDF:
appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
, curl=curl, followLocation = TRUE)
pdfData <- getBinaryURL(appURL, curl = curl, .opts = list(cookie = "ADSCOPYRIGHT=YES"))
writeBin(pdfData, "test2.pdf")
Here's a longer version showing his working
appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
, curl=curl, followLocation = TRUE)
appData <- getURL(appURL, curl = curl)
# get the necessary elements for the POST that is initiated when the ACCEPT button is pressed
doc <- htmlParse(appData)
appAttrs <- doc["//input", fun = xmlAttrs]
postData <- lapply(appAttrs, function(x){data.frame(name = x[["name"]], value = x[["value"]]
, stringsAsFactors = FALSE)})
postData <- do.call(rbind, postData)
# post your acceptance
postURL <- "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid="
# get jsessionid
jsessionid <- unlist(strsplit(getCurlInfo(curl)$cookielist[1], "\t"))[7]
searchData <- postForm(paste0(postURL, jsessionid), curl = curl,
"j_id10" = "j_id10",
from = postData[postData$name == "from", "value"],
"javax.faces.ViewState" = postData[postData$name == "javax.faces.ViewState", "value"],
"j_id10:_idcl" = "j_id10:agreeButton"
, binary = TRUE
)
con <- file("test.pdf", open = "wb")
writeBin(searchData, con)
close(con)
Pressing the ACCEPT button on the page you gave initiates a POST to "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid=......" via some javascript.
This post then redirects to the page with the pdf having given some additional cookies.
Checking our cookies we see:
> getCurlInfo(curl)$cookielist
[1] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tJSESSIONID\t3d249e3d7c98ec35998e69e15d3e"
[2] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tSSOSESSIONID\t3d249e3d7c98ec35998e69e15d3e"
[3] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tADSCOPYRIGHT\tYES"
so it would probably be sufficient to set that last cookie to start with (indicating we accept copyright)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With