Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, use rvest and xml2 to extract JSON object from a <script> element on website

Previously posted a related stackoverflow question about scraping a table on the leaderboard page of the PGA's website on this page. To summarize that post, the leaderboard table is difficult to scrape apparently because of the way this page uses javascript to render the page and table.

I can inspect and I see in the tag that there is an object global.leaderboardConfig with useful info:

enter image description here

Is it possible to get this object as a list in R? I am able to grab all 76 script elements on the page using xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% html_nodes('script'), however I'm not sure how to identify the specific script tag needed, nor how to get an object out of it.

Edit: In the networks tab of devtools, there is also this request which provides the link for an API call that gets the data. Rather than fetching the object from the script tag, perhaps it is easier to grab all network requests and sift through those instead?

enter image description here

like image 343
Canovice Avatar asked Apr 16 '21 00:04

Canovice


People also ask

How do I extract a JSON file from a website?

The first step in this process is to choose a web scraper for your project. We obviously recommend ParseHub. Not only is it free to use, but it also works with all kinds of websites. With ParseHub, web scraping is as simple as clicking on the data you want and downloading it as an excel sheet or JSON file.

How do I extract data from a website in R?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.

What is the purpose of Rvest package in R?

Overview. rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

Can you r for web scraping?

There are several web scraping tools out there to perform the task and various languages too, having libraries that support web scraping. Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc.


1 Answers

This site generates the hmac and expire url parameters value from a JS function that is using a specific algorithm. The arguments of this algorithm are depending on the epoch time which is passed as url parameter to the JS file hosting that function here. This way, the hmac value is different each time because it's processed from this file whose url is changing constantly.

This algorithm consists of bitwise and & xor like this (pseudocode):

step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step

step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step

....
....

The xorKey numbers are generated dynamically on https://microservice.pgatour.com/js based on epoch time. You just need to request this js file with the current epoch time as url parameter and extract with regex all stepValues that are required in the above algorithm (starting with -1). You will also need to reproduce the alogithm above in r

The following script generates the url parameters and makes the API call:

library(httr)
library(stringr)
library(bitops)

# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))

# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000

output <- content(GET("https://microservice.pgatour.com/js", query = list(
    "_" = format(time, digits=13)
  )), as = "text", encoding = "UTF-8")

steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num

for(t in stepsNum){
    step <- bitXor(bitAnd(value * value - encodedId, init), t)
    result <- paste0(result, rawToChar(as.raw(step)));
    value <- step;
}
print(result)

# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])

url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")

print(data)

kaggle link: https://www.kaggle.com/bertrandmartel/pgatourextract

How to find this algorithm ?

I've searched in the Javascript code and reversed the obfuscated code to be decoded into something understandable. This is quite a long way to go. Let's go there step by step.

Mission n°1 - search for leaderboardUrl

You've given the first hint in your question, the location of the config where there is a leaderboardUrl.

There is this JS file named stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js that have occurences of leaderboardUrl in config.leaderboardUrl:

{
    key: "getLeaderboardData",
    value: function (t, r, n) {
      var o = this,
        e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()),  <===================== HERE
        a = [this.performFetch(e)].concat(
          g(
            "initial" === n && this.config.translationsUrl
              ? [y.default.load(this.config.translationsUrl)]
              : []
          )
        ),
        ..........
}

Let's look at performFetch function that seems to send the request

{
    key: "performFetch",
    value: function (t) {
      var r = this,
        e =
          1 < arguments.length && void 0 !== arguments[1]
            ? arguments[1]
            : {};
      return t
        ? ((0, a.isProtectedUrl)(t) &&
            (t = this.getUrlWithAuth(t)), <===================== HERE
          (0, o.default)(t, e)
            .then(function (e) {
              return r.checkFetchResponseStatus(e, t);
    .................

We've spotted the getUrlWithAuth function:

  {
    key: "getUrlWithAuth",
    value: function (e) {
      var t = u.setTrackingUserId, 
        r = u.UserIdTracker, 
        n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
      if (t && n) {
        var o = r.getTrackingUserIdParam(), <===================== HERE
          a = t(r.getUserId());
        return u.setUrlParameter(e, o, a);
      }
      return e;
    },
  },

Now, we have getUserId and getTrackingUserIdParam that look like the function and variable adding the authorization parameters to the url. The problem is we have to find where is this function located.

Mission n°2 - Deobfuscation challenge: substitutions

I've spotted this file named main.c03ddfd249437fcce43410c35a21c6f8.js where there is an occurence of getUserId and getTrackingUserIdParam :

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
    return t[g -= 398]
},
(function(g, e) {
    for (var t = A; ; )
        try {
            if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
                break;
            g.push(g.shift())
        } catch (e) {
            g.push(g.shift())
        }
}
)(t)
.................
function(g, e) {
    var t = A
      , C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
      , I = t(428) + t(423) + t(407)
      , o = t(483) + "rTr" + t(446) + t(477) + "Id";
    C[t(489) + t(463) + t(469) + "cker"] = {
        ........................
        getTrackingUserIdParam: function() {
            return o
        },
        getUserId: function() {
            return I
        },
        ......................
    }
}(jQuery, window)
},

I've skipped a lot of code in the above snippet so it's more clear.

You can see that there are substitutions here, using the t array as a base, it will offset the strings using the A function and there is an init function that updated the initial t array so that it decodes to the right strings

You can paste this snippet into a nodejs script, modify it a little and then you can use something like:

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];

var A = function(g, e) {
    return t[g -= 398]
};
console.log(t);
(function(g, e) {
    for (var t = A; ; )
        try {
            if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
                break;
            g.push(g.shift())
        } catch (e) {
            g.push(g.shift())
        }
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);

// prints e[pgatour] = e[pgatour] || {};

Here e is window so you "just" have to substitute all the A(XXX) in order to understand better what is going on.

You would spot this:

onBeforeSendRequest: function(g, e) {
    var A = t;
    if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
        var I = this["getUse" + A(463)]()
            , o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
            , n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
        e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
    }
},

which when decoded gives something like:

onBeforeSendRequest: function(g, e) {
    if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
        var I = this["getUserId"]()
            , o = window["pgatour"]["setTrackingUserId"](I)
            , n = this["getTrackingUserIdParam"]();
        e.url = C["setUrlParameter"](e["url"], n, o)
    }
},

The function we are looking for is window["pgatour"]["setTrackingUserId"]. But we could have known this since mission n°1. Remember in the first JS file:

var t = u.setTrackingUserId

and u being window.pgatour

But here, we have I the input parameter that is hard coded :

var I = A(428) + A(423) + A(407);

which is equivalent to var I = "id8730931"

Now let's look at window["pgatour"]["setTrackingUserId"] function

Mission n°3 - Crypto/reverse

Open chrome developer console on the website, paste window["pgatour"]["setTrackingUserId"] you will get something like this:

function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................

Yes :( again more obfuscated code to deal with

By looking at the application script, you may find that it's located in this file. This is the JS file url:

https://microservice.pgatour.com/js?_=1618868625306

There is an url parameter specifying an epoch time and the code changes depending on this parameter

Looking at the code itself, we get something like this after substituing the input parameters which are String.fromCharCode and Math.abs

((function($__$, _, $_$) { 
    var $$_ = 4294967295; <===================== doesn't change when the epoch time is updated
    function _$__($) {
        var $$__ = 42;
        for (var _ = 0; _ < $.length; _++) {
            $$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
        }
        return Math.abs($$__);
    }
    ......
    _$_ = (__ * __ - $$) & $$_ ^ -30086, <===================== doesn't change when the epoch time is updated
    ___ += _(_$_),
    __ = _$_,
    _$_ = (__ * __ - $$) & $$_ ^ -33221,
    ___ += _(_$_),
    .....
    $__$[__$_] = (function(_$_$ = "id8730931") { <===================== this is window["pgatour"]["setTrackingUserId"] function / input is id8730931
        var $$ = _$__(_$_$);
        var _$_, ___, __;
        var __ = (__ = 101,
        ___ = String.fromCharCode(__),
        _$_ = (__ * __ - $$) & $$_ ^ -1798328965, <===================== this change when epoch time is updated
        ___ += String.fromCharCode(_$_),
        __ = _$_,
        _$_ = (__ * __ - $$) & $$_ ^ -1798324966,
        ___ += String.fromCharCode(_$_),
        __ = _$_,
        ....
        __ = _$_,
        ___);
        return __
    }
    );
}
)((window.pgatour || (window.pgatour = {})), String.fromCharCode, Math.abs));

We can make a nodejs script to reproduce this algorithm in a simpler way by extracting the step value (in the xor stage):

const axios = require("axios");

const init = 4294967295;
var value = 101;
var encodedId = 1798339286;
var result = String.fromCharCode(value);

(async function () {
  const response = await axios.get(
    "https://microservice.pgatour.com/js?_=1618868625506"
  );
  data = response.data.match(/-17\d+/g).map((it) => parseInt(it));

  for (t of data) {
    var step = ((value * value - encodedId) & init) ^ t;
    result += String.fromCharCode(step);
    value = step;
  }
  console.log(result);
})();

output:

exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd

If you change the epoch time, it will give a different result

repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt

Then you just need to convert this nodejs script in r and make your http call with the url parameters

Note that encodedId comes from the input id id8730931 converted using this function (those values don't seem to change with the epoch time):

var $$_ = 4294967295;
function _$__($) {
    var $$__ = 42;
    for (var _ = 0; _ < $.length; _++) {
        $$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
    }
    return Math.abs($$__);
}

My guess is that the server is checking that the hmac is correctly referring to the initial id string id8730931 so it's safe to harcode (since it's also harcoded in the server)

like image 153
Bertrand Martel Avatar answered Oct 27 '22 07:10

Bertrand Martel