Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to login and then download a file from aspx web pages with R

I'm trying to automate the download of the Panel Study of Income Dynamics files available on this web page using R. Clicking on any of those files takes the user through to this login/authentication page. After authentication, it's easy to download the files with your web browser. Unfortunately, the httr code below does not appear to be maintaining the authentication. I have tried inspecting the Headers in Chrome for the Login.aspx page (as described here), but it doesn't appear to maintain the authentication even when I believe I'm passing in all the correct values. I don't care if it's done with httr or RCurl or something else, I'd just like something that works inside R so I don't need to have users of this script have to download the files manually or with some completely separate program. One of my attempts at this is below, but it doesn't work. Any help would be appreciated. Thanks!! :D

require(httr)

values <- 
    list( 
        "ctl00$ContentPlaceHolder3$Login1$UserName" = "[email protected]" , 
        "ctl00$ContentPlaceHolder3$Login1$Password" = "somepassword" ,
        "ctl00$ContentPlaceHolder3$Login1$LoginButton" = "Log In" ,
        "_LASTFOCUS" = "" ,
        "_EVENTTARGET" = "" ,
        "_EVENTARGUMENT" = "" 
    )

POST( "http://simba.isr.umich.edu/u/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fZips%2fZipMain.aspx" , body = values )

resp <- GET( "http://simba.isr.umich.edu/Zips/GetFile.aspx" , query = list( file = "1053" ) )
like image 515
Anthony Damico Avatar asked Apr 06 '13 16:04

Anthony Damico


1 Answers

Beside storing the cookie after authentication (see my above comment) there was another problematic point in your solution: the ASP.net site sets a VIEWSTATE key-value pair in the cookie which is to be reserved in your queries - if you check, you could not even login in your example (the result of the POST command holds info about how to login, just check it out).

An outline of a possible solution:

  1. Load RCurl package:

    > library(RCurl)
    
  2. Set some handy curl options:

    > curl = getCurlHandle()
    > curlSetOpt(cookiejar = 'cookies.txt', followlocation = TRUE, autoreferer = TRUE, curl = curl)
    
  3. Load the page for the first time to capture VIEWSTATE:

    > html <- getURL('http://simba.isr.umich.edu/u/Login.aspx', curl = curl)
    
  4. Extract VIEWSTATE with a regular expression or any other tool:

    > viewstate <- as.character(sub('.*id="__VIEWSTATE" value="([0-9a-zA-Z+/=]*).*', '\\1', html))
    
  5. Set the parameters as your username, password and the VIEWSTATE:

    > params <- list(
        'ctl00$ContentPlaceHolder3$Login1$UserName'    = '<USERNAME>',
        'ctl00$ContentPlaceHolder3$Login1$Password'    = '<PASSWORD>',
        'ctl00$ContentPlaceHolder3$Login1$LoginButton' = 'Log In',
        '__VIEWSTATE'                                  = viewstate
        )
    
  6. Log in at last:

    > html = postForm('http://simba.isr.umich.edu/u/Login.aspx', .params = params, curl = curl)
    

    Congrats, now you are logged in and curl holds the cookie verifying that!

  7. Verify if you are logged in:

    > grepl('Logout', html)
    [1] TRUE
    
  8. So you can go ahead and download any file - just be sure to pass curl = curl in all your queries.

like image 138
daroczig Avatar answered Oct 02 '22 13:10

daroczig