Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Programmatically scraping a response header within R

Tags:

curl

r

httr

I am trying to access the highlighted response header: location text in the screenshot below using only R and its curl-based webscraping libraries. one can easily get to this point in any web browser by visiting http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp, clicking on the download for any of the data files, and filling out the agreement form. The download begins automatically in a web browser.

enter image description here

I believe that the only way to obtain a valid cookie is with library(curlconverter) (see How to download a file behind a semi-broken javascript asp function with R) but that answer does not appear to be enough to programmatically determine the http url of the file, only to download the zipped file once it's already known.

I've pasted some code below with different httr and curlconverter code that I've played around with, but I'm missing something here. Again, the only goal is to programmatically determine the highlighted text entirely within R (cross-platform).

library(curlconverter)
library(httr)

browserPOST <-
    "curl 'http://www.worldvaluessurvey.org/AJDownload.jsp'
    -H 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    -H 'Accept-Encoding:gzip, deflate'
    -H 'Accept-Language:en-US,en;q=0.8'
    -H 'Cache-Control:max-age=0'
    --compressed -H 'Connection:keep-alive'
    -H 'Content-Length:188'
    -H 'Content-Type:application/x-www-form-urlencoded'
    -H 'Cookie:ASPSESSIONIDCASQAACD=IBLGBFOAEHFILMMJJCFEOEMI; JSESSIONID=50DABDEDD0B2FC370C415B4BD1855260; __atuvc=13%7C45; __atuvs=58224f37d312c42400c'
    -H 'Host:www.worldvaluessurvey.org'
    -H 'Origin:http://www.worldvaluessurvey.org'
    -H 'Referer:http://www.worldvaluessurvey.org/AJDownloadLicense.jsp'
    -H 'Upgrade-Insecure-Requests:1'
    -H 'User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'"

form_data <-
    list( 
        ulthost = "WVS" ,
        CMSID = "" ,
        LITITLE = "" ,
        LINOMBRE = "fas" ,
        LIEMPRESA = "asf" ,
        LIEMAIL = "asdf" ,
        LIPROJECT = "asfd" ,
        LIUSE = "1" ,
        LIPURPOSE = "asdf" ,
        LIAGREE = "1" ,
        DOID = "3996" ,
        CndWAVE = "-1" ,
        SAID = "-1" ,
        AJArchive = "WVS Data Archive" ,
        EdFunction = "" ,
        DOP = "" 
    )   



getDATA <- (straighten(browserPOST) %>% make_req)[[1]]()

a <- VERB(verb = "POST", url = "http://www.worldvaluessurvey.org/AJDownload.jsp", 
    httr::add_headers(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
        `Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8", 
        `Cache-Control` = "max-age=0", Connection = "keep-alive", 
        `Content-Length` = "188", Host = "www.worldvaluessurvey.org", 
        Origin = "http://www.worldvaluessurvey.org", Referer = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp", 
        `Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"), 
    httr::set_cookies(`Cookie:ASPSESSIONIDCASQAACD` = "IBLGBFOAEHFILMMJJCFEOEMI", 
        JSESSIONID = "50DABDEDD0B2FC370C415B4BD1855260", `__atuvc` = "13%7C45", 
        `__atuvs` = "58224f37d312c42400c"), encode = "form",body=form_data)
like image 491
Anthony Damico Avatar asked Nov 08 '16 23:11

Anthony Damico


People also ask

Can you Webscrape with R?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.

How do I scrape hidden data from a website?

You can use the Attribute selector to scrape these hidden tags from HTML. You can write your selector manually and then enter the “content” in the attribute name option to scrape efficiently.


1 Answers

This was a nice challenge!

The problem is not related to R language. We'll have the same result in any language if we just try to post some data to the download script. We have to deal with some kind of security “pattern” here. The site restricts users from retrieving the files urls and it asks them to fill forms with data in order to provide those links. If a browser can retrieve these links, then we can too by writing the proper HTTP calls. Thing is, we need to know exactly which calls we have to make. In order to find that, we need to see the individual calls the site does whenever someone clicks to download. Here is what I found a few calls before a successful 302 AJDownload.jsp POST call:

Http requests

We can see it clearly, if we look at the AJDocumentation.jsp source, it makes these calls by using jQuery $.get:

$.get("http://ipinfo.io?token=xxxxxxxxxxxxxx", function (response) {
    var geodatos=encodeURIComponent(response.ip+"\t"+response.country+"\t"+response.postal+"\t"+
    response.loc+"\t"+response.region+"\t"+response.city+"\t"+
    response.org);

    $.get("jdsStatJD.jsp?ID="+geodatos+
        "&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation",
        function (resp2) {
    });
}, "jsonp");

Then, a few calls below, we can see the successful POST /AJDownload.jsp with status 302 Moved Temporarily and with the wanted Location in its response headers:

Http requests

HTTP/1.1 302 Moved Temporarily
Content-Length: 0
Content-Type: text/html
Location: http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Thu, 01 Dec 2016 16:24:37 GMT

So, this is the security mechanism of this site. It uses ipinfo.io to store visitor informations about their IP, Location and even the ISP organization, just before the user is about to initiate a download by clicking on a link. The script which receives these data, is the /jdsStatJD.jsp. I haven’t used ipinfo.io, nor their API key for this service (have it hidden on my screenshots) and instead I created a dummy valid sequence of data, just to validate the request. The post form data for the “protected” files are not require at all. It is possible to download the files without posting these data.

Also, the curlconverter library is not required. All we have to do, is simple GET and POST requests by using httr library. One important part I want to point out, is that in order to prevent httr POST function from following the Location header received with 302 status at our last call, we need to use the config setting config(followlocation = FALSE) which of course will prevent it from following the Location and let us fetch the Location from the headers.

OUTPUT

My R script can be run from the command line and it can accept DOID numeric values for parameters to get the file needed. For example, if we want to get the link for the file WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18, then we have to add its DOID (which is 3724) to the end of our script when calling it using the Rscript command:

Rscript wvs_fetch_downloads.r 3724
[1] "http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip"

I have created an R function to get each file location you want by just passing the DOID:

getFileById <- function(fileId)

You can remove the command line argument parsing and use the function by passing the DOID directly:

#args <- commandArgs(TRUE)
#if(length(args) == 0) {
#   print("No file id specified. Use './script.r ####'.")
#   quit("no")
#}

#fileId <- args[1]
fileId <- "3724"

# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18

getFileById(fileId)

Final R working script

library(httr)

getFileById <- function(fileId) {
    response <- GET(
        url = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1", 
        add_headers(
            `Accept` = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive", 
            `Host` = "www.worldvaluessurvey.org", 
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp", 
            `Upgrade-Insecure-Requests` = "1"))

    set_cookie <- headers(response)$`set-cookie`
    cookies <- strsplit(set_cookie, ';')
    cookie <- cookies[[1]][1]

    response <- GET(
        url = "http://www.worldvaluessurvey.org/jdsStatJD.jsp?ID=2.72.48.149%09IT%09undefined%0941.8902%2C12.4923%09Lazio%09Roma%09Orange%20SA%20Telecommunications%20Corporation&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation", 
        add_headers(
            `Accept` = "*/*", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive", 
            `X-Requested-With` = "XMLHttpRequest",
            `Host` = "www.worldvaluessurvey.org", 
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
            `Cookie` = cookie))

    post_data <- list( 
        ulthost = "WVS",
        CMSID = "",
        CndWAVE = "-1",
        SAID = "-1",
        DOID = fileId,
        AJArchive = "WVS Data Archive",
        EdFunction = "",
        DOP = "",
        PUB = "")  

    response <- POST(
        url = "http://www.worldvaluessurvey.org/AJDownload.jsp", 
        config(followlocation = FALSE),
        add_headers(
            `Accept` = "*/*", 
            `Accept-Encoding` = "gzip, deflate",
            `Accept-Language` = "en-US,en;q=0.8", 
            `Cache-Control` = "max-age=0",
            `Connection` = "keep-alive",
            `Host` = "www.worldvaluessurvey.org",
            `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
            `Content-type` = "application/x-www-form-urlencoded",
            `Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
            `Cookie` = cookie),
        body = post_data,
        encode = "form")

    location <- headers(response)$location
    location
}

args <- commandArgs(TRUE)
if(length(args) == 0) {
    print("No file id specified. Use './script.r ####'.")
    quit("no")
}

fileId <- args[1]

# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18

getFileById(fileId)
like image 107
Christos Lytras Avatar answered Oct 16 '22 22:10

Christos Lytras