Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

automating the login to the uk data service website in R with RCurl or httr

I am in the process of writing a collection of freely-downloadable R scripts for http://asdfree.com/ to help people analyze the complex sample survey data hosted by the UK data service. In addition to providing lots of statistics tutorials for these data sets, I also want to automate the download and importation of this survey data. In order to do that, I need to figure out how to programmatically log into this UK data service website.

I have tried lots of different configurations of RCurl and httr to log in, but I'm making a mistake somewhere and I'm stuck. I have tried inspecting the elements as outlined in this post, but the websites jump around too fast in the browser for me to understand what's going on.

This website does require a login and password, but I believe I'm making a mistake before I even get to the login page.

Here's how the website works:

The starting page should be: https://www.esds.ac.uk/secure/UKDSRegister_start.asp

This page will automatically re-direct your web browser to a long URL that starts with: https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah]

(1) For some reason, the SSL certificate does not work on this website. Here's the SO question I posted regarding this. The workaround I've used is simply ignoring the SSL:

library(httr)
set_config( config( ssl.verifypeer = 0L ) )

and then my first command on the starting website is:

z <- GET( "https://www.esds.ac.uk/secure/UKDSRegister_start.asp" )

this gives me back a z$url that looks a lot like the https://wayf.ukfederation.org.uk/DS002/uk.ds?[blahblahblah] page that my browser also re-directs to.

In the browser, then, you're supposed to type in "uk data archive" and click the continue button. When I do that, it re-directs me to the web page https://shib.data-archive.ac.uk/idp/Authn/UserPassword

I think this is where I'm stuck because I cannot figure out how to have cURL followlocation and land on this website. Note: no username/password has been entered yet.

When I use the httr GET command from the wayf.ukfederation.org.uk page like this:

 y <- GET( z$url , query = list( combobox = "https://shib.data-archive.ac.uk/shibboleth-idp" ) )

the y$url string looks a lot like z$url (except it's got a combobox= on the end). Is there any way to get through to this uk data archive authentication page with RCurl or httr?

I can't tell if I'm just overlooking something or if I absolutely must use the SSL certificate described in my previous SO post or what?

(2) At the point I do make it through to that page, I believe the remainder of the code would just be:

values <- list( j_username = "your.username" , 
                j_password = "your.password" )
POST( "https://shib.data-archive.ac.uk/idp/Authn/UserPassword" , body = values)

But I guess that page will have to wait...

like image 593
Anthony Damico Avatar asked Jul 22 '13 01:07

Anthony Damico


2 Answers

The relevant data variables returned by the form are action and origin, not combobox. Give action the value selection and origin the value from the relevant entry in combobox

y <- GET( z$url, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
> y$url
[1] "https://shib.data-archive.ac.uk:443/idp/Authn/UserPassword"

Edit

It looks as though the handle pool isn't keeping your session alive correctly. You therefore need to pass the handles directly rather than automatically. Also for the POST command you need to set multipart=FALSE as this is the default for HTML forms. The R command has a different default as it is mainly designed for uploading files. So:

y <- GET( handle=z$handle, query = list( action="selection", origin = "https://shib.data-archive.ac.uk/shibboleth-idp") )
POST(body=values,multipart=FALSE,handle=y$handle)
Response [https://www.esds.ac.uk/]
  Status: 200
  Content-type: text/html

...snipped...    


                <title>

                        Introduction to ESDS

                </title>

                <meta name="description" content="Introduction to the ESDS, home page" />
like image 123
James Avatar answered Nov 05 '22 10:11

James


I think one way to address "enter your organization" page goes like this:

library(tidyverse)
library(rvest)
library(stringr)

org <- "your_organization"
user <- "your_username"
password <- "your_password"

signin <- "http://esds.ac.uk/newRegistration/newLogin.asp"
handle_reset(signin)

# get to org page and enter org
p0 <- html_session(signin) %>% 
    follow_link("Login")
org_link <- html_nodes(p0, "option") %>% 
    str_subset(org) %>% 
    str_match('(?<=\\")[^"]*') %>%
    as.character()

f0 <- html_form(p0) %>%
    first() %>%
    set_values(origin = org_link)
fake_submit_button <- list(name = "submit-btn",
                           type = "submit",
                           value = "Continue",
                           checked = NULL,
                           disabled = NULL,
                           readonly = NULL,
                           required = FALSE)
attr(fake_submit_button, "class") <- "btn-enabled"
f0[["fields"]][["submit"]] <- fake_submit_button

c0 <- cookies(p0)$value
names(c0) <- cookies(p0)$name
p1 <- submit_form(session = p0, form = f0, config = set_cookies(.cookies = c0))

Unfortunately, that doesn't solve the whole problem—(2) is harder than it looks. I've got more of what I think is a solution posted here: R: use rvest (or httr) to log in to a site requiring cookies. Hopefully someone will help us get the rest of the way.

like image 39
Frederick Solt Avatar answered Nov 05 '22 10:11

Frederick Solt