Previously posted a related stackoverflow question about scraping a table on the leaderboard page of the PGA's website on this page. To summarize that post, the leaderboard table is difficult to scrape apparently because of the way this page uses javascript to render the page and table.
I can inspect and I see in the tag that there is an object global.leaderboardConfig
with useful info:
Is it possible to get this object as a list in R? I am able to grab all 76 script elements on the page using xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% html_nodes('script')
, however I'm not sure how to identify the specific script tag needed, nor how to get an object out of it.
Edit: In the networks tab of devtools, there is also this request which provides the link for an API call that gets the data. Rather than fetching the object from the script tag, perhaps it is easier to grab all network requests and sift through those instead?
The first step in this process is to choose a web scraper for your project. We obviously recommend ParseHub. Not only is it free to use, but it also works with all kinds of websites. With ParseHub, web scraping is as simple as clicking on the data you want and downloading it as an excel sheet or JSON file.
In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.
Overview. rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.
There are several web scraping tools out there to perform the task and various languages too, having libraries that support web scraping. Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc.
This site generates the hmac
and expire
url parameters value from a JS function that is using a specific algorithm. The arguments of this algorithm are depending on the epoch time which is passed as url parameter to the JS file hosting that function here. This way, the hmac
value is different each time because it's processed from this file whose url is changing constantly.
This algorithm consists of bitwise and & xor like this (pseudocode):
step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step
step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step
....
....
The xorKey
numbers are generated dynamically on https://microservice.pgatour.com/js
based on epoch time. You just need to request this js file with the current epoch time as url parameter and extract with regex all stepValues
that are required in the above algorithm (starting with -1
). You will also need to reproduce the alogithm above in r
The following script generates the url parameters and makes the API call:
library(httr)
library(stringr)
library(bitops)
# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))
# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000
output <- content(GET("https://microservice.pgatour.com/js", query = list(
"_" = format(time, digits=13)
)), as = "text", encoding = "UTF-8")
steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num
for(t in stepsNum){
step <- bitXor(bitAnd(value * value - encodedId, init), t)
result <- paste0(result, rawToChar(as.raw(step)));
value <- step;
}
print(result)
# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])
url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")
print(data)
kaggle link: https://www.kaggle.com/bertrandmartel/pgatourextract
I've searched in the Javascript code and reversed the obfuscated code to be decoded into something understandable. This is quite a long way to go. Let's go there step by step.
leaderboardUrl
You've given the first hint in your question, the location of the config
where there is a leaderboardUrl
.
There is this JS file named stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js
that have occurences of leaderboardUrl
in config.leaderboardUrl
:
{
key: "getLeaderboardData",
value: function (t, r, n) {
var o = this,
e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()), <===================== HERE
a = [this.performFetch(e)].concat(
g(
"initial" === n && this.config.translationsUrl
? [y.default.load(this.config.translationsUrl)]
: []
)
),
..........
}
Let's look at performFetch
function that seems to send the request
{
key: "performFetch",
value: function (t) {
var r = this,
e =
1 < arguments.length && void 0 !== arguments[1]
? arguments[1]
: {};
return t
? ((0, a.isProtectedUrl)(t) &&
(t = this.getUrlWithAuth(t)), <===================== HERE
(0, o.default)(t, e)
.then(function (e) {
return r.checkFetchResponseStatus(e, t);
.................
We've spotted the getUrlWithAuth
function:
{
key: "getUrlWithAuth",
value: function (e) {
var t = u.setTrackingUserId,
r = u.UserIdTracker,
n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
if (t && n) {
var o = r.getTrackingUserIdParam(), <===================== HERE
a = t(r.getUserId());
return u.setUrlParameter(e, o, a);
}
return e;
},
},
Now, we have getUserId
and getTrackingUserIdParam
that look like the function and variable adding the authorization parameters to the url. The problem is we have to find where is this function located.
I've spotted this file named main.c03ddfd249437fcce43410c35a21c6f8.js
where there is an occurence of getUserId
and getTrackingUserIdParam
:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
},
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
.................
function(g, e) {
var t = A
, C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
, I = t(428) + t(423) + t(407)
, o = t(483) + "rTr" + t(446) + t(477) + "Id";
C[t(489) + t(463) + t(469) + "cker"] = {
........................
getTrackingUserIdParam: function() {
return o
},
getUserId: function() {
return I
},
......................
}
}(jQuery, window)
},
I've skipped a lot of code in the above snippet so it's more clear.
You can see that there are substitutions here, using the t
array as a base, it will offset the strings using the A
function and there is an init function that updated the initial t
array so that it decodes to the right strings
You can paste this snippet into a nodejs script, modify it a little and then you can use something like:
var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
return t[g -= 398]
};
console.log(t);
(function(g, e) {
for (var t = A; ; )
try {
if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
break;
g.push(g.shift())
} catch (e) {
g.push(g.shift())
}
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);
// prints e[pgatour] = e[pgatour] || {};
Here e
is window
so you "just" have to substitute all the A(XXX)
in order to understand better what is going on.
You would spot this:
onBeforeSendRequest: function(g, e) {
var A = t;
if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
var I = this["getUse" + A(463)]()
, o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
, n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
}
},
which when decoded gives something like:
onBeforeSendRequest: function(g, e) {
if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
var I = this["getUserId"]()
, o = window["pgatour"]["setTrackingUserId"](I)
, n = this["getTrackingUserIdParam"]();
e.url = C["setUrlParameter"](e["url"], n, o)
}
},
The function we are looking for is window["pgatour"]["setTrackingUserId"]
. But we could have known this since mission n°1. Remember in the first JS file:
var t = u.setTrackingUserId
and u
being window.pgatour
But here, we have I
the input parameter that is hard coded :
var I = A(428) + A(423) + A(407);
which is equivalent to var I = "id8730931"
Now let's look at window["pgatour"]["setTrackingUserId"]
function
Open chrome developer console on the website, paste window["pgatour"]["setTrackingUserId"]
you will get something like this:
function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................
Yes :( again more obfuscated code to deal with
By looking at the application script, you may find that it's located in this file. This is the JS file url:
https://microservice.pgatour.com/js?_=1618868625306
There is an url parameter specifying an epoch time and the code changes depending on this parameter
Looking at the code itself, we get something like this after substituing the input parameters which are String.fromCharCode
and Math.abs
((function($__$, _, $_$) {
var $$_ = 4294967295; <===================== doesn't change when the epoch time is updated
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
......
_$_ = (__ * __ - $$) & $$_ ^ -30086, <===================== doesn't change when the epoch time is updated
___ += _(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -33221,
___ += _(_$_),
.....
$__$[__$_] = (function(_$_$ = "id8730931") { <===================== this is window["pgatour"]["setTrackingUserId"] function / input is id8730931
var $$ = _$__(_$_$);
var _$_, ___, __;
var __ = (__ = 101,
___ = String.fromCharCode(__),
_$_ = (__ * __ - $$) & $$_ ^ -1798328965, <===================== this change when epoch time is updated
___ += String.fromCharCode(_$_),
__ = _$_,
_$_ = (__ * __ - $$) & $$_ ^ -1798324966,
___ += String.fromCharCode(_$_),
__ = _$_,
....
__ = _$_,
___);
return __
}
);
}
)((window.pgatour || (window.pgatour = {})), String.fromCharCode, Math.abs));
We can make a nodejs script to reproduce this algorithm in a simpler way by extracting the step value (in the xor stage):
const axios = require("axios");
const init = 4294967295;
var value = 101;
var encodedId = 1798339286;
var result = String.fromCharCode(value);
(async function () {
const response = await axios.get(
"https://microservice.pgatour.com/js?_=1618868625506"
);
data = response.data.match(/-17\d+/g).map((it) => parseInt(it));
for (t of data) {
var step = ((value * value - encodedId) & init) ^ t;
result += String.fromCharCode(step);
value = step;
}
console.log(result);
})();
output:
exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd
If you change the epoch time, it will give a different result
repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt
Then you just need to convert this nodejs script in r and make your http call with the url parameters
Note that encodedId
comes from the input id id8730931
converted using this function (those values don't seem to change with the epoch time):
var $$_ = 4294967295;
function _$__($) {
var $$__ = 42;
for (var _ = 0; _ < $.length; _++) {
$$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
}
return Math.abs($$__);
}
My guess is that the server is checking that the hmac is correctly referring to the initial id string id8730931
so it's safe to harcode (since it's also harcoded in the server)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With