In reddit URL, there is "5 characternumerics" thing_id part (for example, "wplf7" from "http://redd.it/wplf7") which is generated by base36.
wplf7 is generated from number 54941875 - this is what I found so far... I'm wondering how 54941875 is generated.
I'm trying to scrape comment of a reddit's specific section (let's say http://www.reddit.com/r/leagueoflegends/) using R and I'm stuck at this 5 character numerics.
Anyone who can explain this in the simple manner? Unfortunately Python is not my domain and 2000 lines of python code listed on Reddit's website didn't help me much.
Thanks,
Firstly set an uniqueish user agent as reddit likes this
options(HTTPUserAgent="My name is BOB")
I assumme you want to get the content at http://www.reddit.com/r/leagueoflegends/ . You need to append a .json
to the url:
library(RJSONIO)
library(RCurl)
# library(XML)
jdata<-getURL('http://www.reddit.com/r/leagueoflegends/.json')
jdata<-fromJSON(jdata)
# xdata<-getURL('http://www.reddit.com/r/leagueoflegends/.xml')
# xdata<-xmlParse(xdata)
Obviously the content is very rich for example the domains,permalinks,authors, titles of posts:
Domains<-sapply(jdata[[2]]$children,function(x){x$data$domain})
permalinks<-sapply(jdata[[2]]$children,function(x){x$data$permalink})
authors<-sapply(jdata[[2]]$children,function(x){x$data$author})
titles<-sapply(jdata[[2]]$children,function(x){x$data$title})
ids<-sapply(jdata[[2]]$children,function(x){x$data$id})
created<-as.POSIXct(sapply(jdata[[2]]$children,function(x){x$data$created}),origin="1970/01/01")
> head(titles)
[1] "Pendragon 3-day-banning someone for randoming in ranked, or saying hes going to. Mixed feelings..."
[2] "Dig Kicks L0cust."
[3] "Summoners, y u no communicate??"
[4] "Without Even Trying"
[5] "Cross Country Tryndamere (Chaox Stream)"
[6] "Top 5 Flops - Episode 4 ft Dyrus, Phantoml0rd, and HatPerson vs Baron Nashor"
>
To investigate how these ids are generated we can apply @Ben Bolker s base36ToInteger
function to the ids we have gathered and compare them against the date they were created giving:
createData<-data.frame(created=created,ids=sapply(ids,base36ToInteger))
> dput(createData)
structure(list(created = structure(c(1342658844, 1342657298,
1342622962, 1342643655, 1342641187, 1342654768, 1342665353, 1342640599,
1342648272, 1342662822, 1342654185, 1342659591, 1342624350, 1342647907,
1342637587, 1342591960, 1342625515, 1342642330, 1342651384, 1342668363,
1342608976, 1342608165, 1342632545, 1342638611, 1342643489), class = c("POSIXct",
"POSIXt")), ids = c(55047001, 55044612, 55010018, 55025557, 55022809,
55040754, 55056689, 55022221, 55031424, 55053023, 55039810, 55048123,
55010880, 55030934, 55019343, 54976515, 55011555, 55024060, 55035670,
55061120, 54998192, 54997264, 55015528, 55020295, 55025363)), .Names = c("created",
"ids"), row.names = c("wrujd", "wrsp0", "wr202", "wrdzp", "wrbvd",
"wrppu", "ws20h", "wrbf1", "wriio", "wrz6n", "wrozm", "wrvej",
"wr2o0", "wri52", "wr973", "wqc5f", "wr36r", "wrcu4", "wrlsm",
"ws5fk", "wqsvk", "wqs5s", "wr694", "wr9xj", "wrdub"), class = "data.frame")
which implies that reddit generates these numbers sequentially across the site as new posts are created.
Without a specific direction I will leave it at this but hopefully you get the idea.
I started from code for generic base conversion posted by Erich Neuwirth on r-help in 2008: this is recursive, so may be slow -- but it took the right amount of time for me to develop it!
numberInBase <- function(number,base){
numberInBaseRecur<-function(number,base){
lastDigit<-function(number,base) number %% base
if (number == 0) result <- c(0)
else result <- c(numberInBaseRecur(number %/% base,base),
lastDigit(number,base))
result
}
result <- numberInBaseRecur(number,base)
while (result[1]== 0 && length(result)>1)
result <- result[-1]
result
}
A quick test:
numberInBase(36^3,36)
## [1] 1 0 0 0
Now all we need is to convert from decimal to base 36, then index the appropriate alphanumeric string. Here's your example:
b36string <- c(0:9,letters)
paste(b36string[numberInBase(54941875,36)+1],collapse="")
## [1] "wplf7"
If you need to go the other way, there is a post by Jim Holtman from Jan 2012 that gives a solution:
base36ToInteger <- function (Str) {
common <- chartr("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
, ":;<=>?@ABCDEFGHIJKLMNOPQRS:;<=>?@ABCDEFGHIJKLMNOPQRS"
, Str)
x <- as.numeric(charToRaw(common)) - 48
sum(x * 36 ^ rev(seq(length(x)) - 1))
}
base36ToInteger("wplf7")
(I haven't stopped to figure out how this actually works, but you can read the post ...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With