I am trying to access the highlighted response header: location text in the screenshot below using only R and its curl-based webscraping libraries. one can easily get to this point in any web browser by visiting http://www.worldvaluessurvey.org/WVSDocumentationWVL.jsp, clicking on the download for any of the data files, and filling out the agreement form. The download begins automatically in a web browser.
I believe that the only way to obtain a valid cookie is with library(curlconverter)
(see How to download a file behind a semi-broken javascript asp function with R) but that answer does not appear to be enough to programmatically determine the http url of the file, only to download the zipped file once it's already known.
I've pasted some code below with different httr and curlconverter code that I've played around with, but I'm missing something here. Again, the only goal is to programmatically determine the highlighted text entirely within R (cross-platform).
library(curlconverter)
library(httr)
browserPOST <-
"curl 'http://www.worldvaluessurvey.org/AJDownload.jsp'
-H 'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
-H 'Accept-Encoding:gzip, deflate'
-H 'Accept-Language:en-US,en;q=0.8'
-H 'Cache-Control:max-age=0'
--compressed -H 'Connection:keep-alive'
-H 'Content-Length:188'
-H 'Content-Type:application/x-www-form-urlencoded'
-H 'Cookie:ASPSESSIONIDCASQAACD=IBLGBFOAEHFILMMJJCFEOEMI; JSESSIONID=50DABDEDD0B2FC370C415B4BD1855260; __atuvc=13%7C45; __atuvs=58224f37d312c42400c'
-H 'Host:www.worldvaluessurvey.org'
-H 'Origin:http://www.worldvaluessurvey.org'
-H 'Referer:http://www.worldvaluessurvey.org/AJDownloadLicense.jsp'
-H 'Upgrade-Insecure-Requests:1'
-H 'User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'"
form_data <-
list(
ulthost = "WVS" ,
CMSID = "" ,
LITITLE = "" ,
LINOMBRE = "fas" ,
LIEMPRESA = "asf" ,
LIEMAIL = "asdf" ,
LIPROJECT = "asfd" ,
LIUSE = "1" ,
LIPURPOSE = "asdf" ,
LIAGREE = "1" ,
DOID = "3996" ,
CndWAVE = "-1" ,
SAID = "-1" ,
AJArchive = "WVS Data Archive" ,
EdFunction = "" ,
DOP = ""
)
getDATA <- (straighten(browserPOST) %>% make_req)[[1]]()
a <- VERB(verb = "POST", url = "http://www.worldvaluessurvey.org/AJDownload.jsp",
httr::add_headers(Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
`Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8",
`Cache-Control` = "max-age=0", Connection = "keep-alive",
`Content-Length` = "188", Host = "www.worldvaluessurvey.org",
Origin = "http://www.worldvaluessurvey.org", Referer = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp",
`Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"),
httr::set_cookies(`Cookie:ASPSESSIONIDCASQAACD` = "IBLGBFOAEHFILMMJJCFEOEMI",
JSESSIONID = "50DABDEDD0B2FC370C415B4BD1855260", `__atuvc` = "13%7C45",
`__atuvs` = "58224f37d312c42400c"), encode = "form",body=form_data)
In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.
You can use the Attribute selector to scrape these hidden tags from HTML. You can write your selector manually and then enter the “content” in the attribute name option to scrape efficiently.
This was a nice challenge!
The problem is not related to R language. We'll have the same result in any language if we just try to post some data to the download script. We have to deal with some kind of security “pattern” here. The site restricts users from retrieving the files urls and it asks them to fill forms with data in order to provide those links. If a browser can retrieve these links, then we can too by writing the proper HTTP calls. Thing is, we need to know exactly which calls we have to make. In order to find that, we need to see the individual calls the site does whenever someone clicks to download. Here is what I found a few calls before a successful 302 AJDownload.jsp
POST
call:
We can see it clearly, if we look at the AJDocumentation.jsp
source, it makes these calls by using jQuery $.get
:
$.get("http://ipinfo.io?token=xxxxxxxxxxxxxx", function (response) {
var geodatos=encodeURIComponent(response.ip+"\t"+response.country+"\t"+response.postal+"\t"+
response.loc+"\t"+response.region+"\t"+response.city+"\t"+
response.org);
$.get("jdsStatJD.jsp?ID="+geodatos+
"&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation",
function (resp2) {
});
}, "jsonp");
Then, a few calls below, we can see the successful POST /AJDownload.jsp
with status 302 Moved Temporarily
and with the wanted Location
in its response headers:
HTTP/1.1 302 Moved Temporarily
Content-Length: 0
Content-Type: text/html
Location: http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
Date: Thu, 01 Dec 2016 16:24:37 GMT
So, this is the security mechanism of this site. It uses ipinfo.io to store visitor informations about their IP, Location and even the ISP organization, just before the user is about to initiate a download by clicking on a link. The script which receives these data, is the /jdsStatJD.jsp
. I haven’t used ipinfo.io, nor their API key for this service (have it hidden on my screenshots) and instead I created a dummy valid sequence of data, just to validate the request. The post form data for the “protected” files are not require at all. It is possible to download the files without posting these data.
Also, the curlconverter
library is not required. All we have to do, is simple GET
and POST
requests by using httr
library. One important part I want to point out, is that in order to prevent httr
POST
function from following the Location
header received with 302
status at our last call, we need to use the config setting config(followlocation = FALSE)
which of course will prevent it from following the Location
and let us fetch the Location
from the headers.
OUTPUT
My R script can be run from the command line and it can accept DOID
numeric values for parameters to get the file needed. For example, if we want to get the link for the file WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18
, then we have to add its DOID
(which is 3724) to the end of our script when calling it using the Rscript
command:
Rscript wvs_fetch_downloads.r 3724
[1] "http://www.worldvaluessurvey.org/wvsdc/CO00001/F00003724-WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18.zip"
I have created an R function to get each file location you want by just passing the DOID
:
getFileById <- function(fileId)
You can remove the command line argument parsing and use the function by passing the DOID
directly:
#args <- commandArgs(TRUE)
#if(length(args) == 0) {
# print("No file id specified. Use './script.r ####'.")
# quit("no")
#}
#fileId <- args[1]
fileId <- "3724"
# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18
getFileById(fileId)
Final R working script
library(httr)
getFileById <- function(fileId) {
response <- GET(
url = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
add_headers(
`Accept` = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
`Accept-Encoding` = "gzip, deflate",
`Accept-Language` = "en-US,en;q=0.8",
`Cache-Control` = "max-age=0",
`Connection` = "keep-alive",
`Host` = "www.worldvaluessurvey.org",
`User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
`Content-type` = "application/x-www-form-urlencoded",
`Referer` = "http://www.worldvaluessurvey.org/AJDownloadLicense.jsp",
`Upgrade-Insecure-Requests` = "1"))
set_cookie <- headers(response)$`set-cookie`
cookies <- strsplit(set_cookie, ';')
cookie <- cookies[[1]][1]
response <- GET(
url = "http://www.worldvaluessurvey.org/jdsStatJD.jsp?ID=2.72.48.149%09IT%09undefined%0941.8902%2C12.4923%09Lazio%09Roma%09Orange%20SA%20Telecommunications%20Corporation&url=http%3A%2F%2Fwww.worldvaluessurvey.org%2FAJDocumentation.jsp&referer=null&cms=Documentation",
add_headers(
`Accept` = "*/*",
`Accept-Encoding` = "gzip, deflate",
`Accept-Language` = "en-US,en;q=0.8",
`Cache-Control` = "max-age=0",
`Connection` = "keep-alive",
`X-Requested-With` = "XMLHttpRequest",
`Host` = "www.worldvaluessurvey.org",
`User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
`Content-type` = "application/x-www-form-urlencoded",
`Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
`Cookie` = cookie))
post_data <- list(
ulthost = "WVS",
CMSID = "",
CndWAVE = "-1",
SAID = "-1",
DOID = fileId,
AJArchive = "WVS Data Archive",
EdFunction = "",
DOP = "",
PUB = "")
response <- POST(
url = "http://www.worldvaluessurvey.org/AJDownload.jsp",
config(followlocation = FALSE),
add_headers(
`Accept` = "*/*",
`Accept-Encoding` = "gzip, deflate",
`Accept-Language` = "en-US,en;q=0.8",
`Cache-Control` = "max-age=0",
`Connection` = "keep-alive",
`Host` = "www.worldvaluessurvey.org",
`User-Agent` = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0",
`Content-type` = "application/x-www-form-urlencoded",
`Referer` = "http://www.worldvaluessurvey.org/AJDocumentation.jsp?CndWAVE=-1",
`Cookie` = cookie),
body = post_data,
encode = "form")
location <- headers(response)$location
location
}
args <- commandArgs(TRUE)
if(length(args) == 0) {
print("No file id specified. Use './script.r ####'.")
quit("no")
}
fileId <- args[1]
# DOID=3843 : WVS_EVS_Integrated_Dictionary_Codebook v_2014_09_22 (Excel)
# DOID=3844 : WVS_Values Surveys Integrated Dictionary_TimeSeries_v_2014-04-25 (Excel)
# DOID=3725 : WVS_Longitudinal_1981-2014_rdata_v_2015_04_18
# DOID=3996 : WVS_Longitudinal_1981-2014_sas_v_2015_04_18
# DOID=3723 : WVS_Longitudinal_1981-2014_spss_v_2015_04_18
# DOID=3724 : WVS_Longitudinal_1981-2014_stata_dta_v_2015_04_18
getFileById(fileId)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With