Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

follow a page redirect using rvest in R

I am new to R and rvest. I am trying to use these to get information from a website (www.medicinescomplete.com) that allows sign in using the Athens academic login system. In a browser, when you click on the athens login button it transfers you to an athens login form. After submitting the user credentials the form then redirects the browser back to the original site but logged in.

I used the submit_form() function to submit the credentials into the athens form and this returns a 200 code. However, R does not follow the redirect as a browser would and if I use the jump_to() command to return to the original site it is not logged in. I suspect that the redirected link returned by the sign in page might contain the log in credentials I need but I do not know how to find the link and send it using rvest

Has anyone worked out how to log in via athens using rvest or has any idea about how to make it follow an automatic redirect??

The code I have used to get this far is (login credentials changed):

library(rvest)
library(magrittr)

url <- "https://www.medicinescomplete.com/about/"
mcsession <- html_session(url)
mcsession <- jump_to(mcsession, "/mc/athens.htm?   uri=https%3A%2F%2Fwww.medicinescomplete.com%2Fabout%2F")
athensform <- html_form(mcsession)[[1]]
athensform <-set_values(athensform, ath_uname = "xxx", ath_passwd = "yyy")
submit_form(mcsession, athensform)
jump_to(mcsession, "https://www.medicinescomplete.com/mc/bnf/current/")

I get 200 code for the submit_form() step but a 403 forbidden code for the jump_to() last line.

I then piped the submit_form step into html() and printed it. From what I could make out it was a successful login but in the body of the main page there is a line referring to redirecting back to the original site. The html for the whole page is too long to post but the relevant bit seems to be:

<div style="padding: 8px;" id="logindiv">
                        <form method="POST" action="https://www.medicinescomplete.com/mc/athens">
                            Please wait while we transfer you. <br><noscript>JavaScript disabled, please<input type="submit" value="click here" style="border:none;background:none;text-decoration:underline;color:#E27B2F;">

And I wonder if this following bit refers to some login key:

<input type="hidden" name="TARGET" value="https://www.medicinescomplete.com/about/" style="display:none"><input type="hidden" name="RelayState" value="https://www.medicinescomplete.com/about/" style="display:none"><input type="hidden" name="SAMLResponse" value="PFJlc3BvbnNlIHhtbG5zPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6cHJvdG9jb2wiIHhtbG5zOnNhbWwyPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6YXNzZXJ0aW9uIiBEZXN...

Aha! Further down the page there is this:

<script>
window.onload = function() { document.forms[0].submit(); }
</script>

I think the window is meant to automatically submit another form that performs the post to the original medicinescomplete.com site to authenticate using the hidden field as a login credential. However, on trying to use the submit_form() on this page I don't seem to get any further! I have added the following line to try and work out what is going on:

> submit_form(mcsession, athensform) %>% html_form() %>% str()

And this gives the following output:

Submitting with 'submit'
List of 1
 $ :List of 5
  ..$ name   : chr "<unnamed>"
  ..$ method : chr "POST"
  ..$ url    : chr "https://www.medicinescomplete.com/mc/athens"
  ..$ enctype: chr "form"
  ..$ fields :List of 4
  .. ..$ NULL        :List of 7
  .. .. ..$ name    : NULL
  .. .. ..$ type    : chr "submit"
  .. .. ..$ value   : chr "click here"
  .. .. ..$ checked : NULL
  .. .. ..$ disabled: NULL
  .. .. ..$ readonly: NULL
  .. .. ..$ required: logi FALSE
  .. .. ..- attr(*, "class")= chr "input"
  .. ..$ TARGET      :List of 7
  .. .. ..$ name    : chr "TARGET"
  .. .. ..$ type    : chr "hidden"
  .. .. ..$ value   : chr "https://www.medicinescomplete.com/about/"
  .. .. ..$ checked : NULL
  .. .. ..$ disabled: NULL
  .. .. ..$ readonly: NULL
  .. .. ..$ required: logi FALSE
  .. .. ..- attr(*, "class")= chr "input"
  .. ..$ RelayState  :List of 7
  .. .. ..$ name    : chr "RelayState"
  .. .. ..$ type    : chr "hidden"
  .. .. ..$ value   : chr "https://www.medicinescomplete.com/about/"
  .. .. ..$ checked : NULL
  .. .. ..$ disabled: NULL
  .. .. ..$ readonly: NULL
  .. .. ..$ required: logi FALSE
  .. .. ..- attr(*, "class")= chr "input"
  .. ..$ SAMLResponse:List of 7
  .. .. ..$ name    : chr "SAMLResponse"
  .. .. ..$ type    : chr "hidden"
  .. .. ..$ value   : chr "PFJlc3BvbnNlIHhtbG5zPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA6cHJvdG9jb2wiIHhtbG5zOnNhbWwyPSJ1cm46b2FzaXM6bmFtZXM6dGM6U0FNTDoyLjA"| __truncated__
  .. .. ..$ checked : NULL
  .. .. ..$ disabled: NULL
  .. .. ..$ readonly: NULL
  .. .. ..$ required: logi FALSE
  .. .. ..- attr(*, "class")= chr "input"
  .. ..- attr(*, "class")= chr "fields"
  ..- attr(*, "class")= chr "form"

I feel like the information in this form should allow me to log in to the original site but I don't quite understand how! Unfortunately when I try the submit_form() function again with this form it doesn't seem to work. I tried this:

submit_form(mcsession, athensform) %>% html_form() %>% submit_form(mcsession, .) %>% html()

And got this:

Submitting with 'submit'
Submitting with ''
Error in if (!(submit %in% names(submits))) { : 
  argument is of length zero
like image 279
iProcrastinate Avatar asked Apr 01 '15 11:04

iProcrastinate


1 Answers

It's very likely tied to this issue which prevents httr to issue the correct GET query on redirect.

It is a little hard to guess though, because you're missing a reproducible example or the complete verbose output of your query.

A workaround is to prevent the redirect with:

rvest::submit_form(...,
                   httr::config(followlocation = FALSE))
like image 135
Antoine Lizée Avatar answered Oct 09 '22 21:10

Antoine Lizée