Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Input of 5 character(digit) ID in reddit URL

Tags:

r

reddit

In reddit URL, there is "5 characternumerics" thing_id part (for example, "wplf7" from "http://redd.it/wplf7") which is generated by base36.

wplf7 is generated from number 54941875 - this is what I found so far... I'm wondering how 54941875 is generated.

I'm trying to scrape comment of a reddit's specific section (let's say http://www.reddit.com/r/leagueoflegends/) using R and I'm stuck at this 5 character numerics.

Anyone who can explain this in the simple manner? Unfortunately Python is not my domain and 2000 lines of python code listed on Reddit's website didn't help me much.

Thanks,

like image 974
user1486507 Avatar asked Dec 15 '22 21:12

user1486507


2 Answers

Firstly set an uniqueish user agent as reddit likes this

options(HTTPUserAgent="My name is BOB")

I assumme you want to get the content at http://www.reddit.com/r/leagueoflegends/ . You need to append a .json to the url:

library(RJSONIO)
library(RCurl)
# library(XML)

jdata<-getURL('http://www.reddit.com/r/leagueoflegends/.json')
jdata<-fromJSON(jdata)
# xdata<-getURL('http://www.reddit.com/r/leagueoflegends/.xml')
# xdata<-xmlParse(xdata)

Obviously the content is very rich for example the domains,permalinks,authors, titles of posts:

Domains<-sapply(jdata[[2]]$children,function(x){x$data$domain})
permalinks<-sapply(jdata[[2]]$children,function(x){x$data$permalink})
authors<-sapply(jdata[[2]]$children,function(x){x$data$author})
titles<-sapply(jdata[[2]]$children,function(x){x$data$title})
ids<-sapply(jdata[[2]]$children,function(x){x$data$id})
created<-as.POSIXct(sapply(jdata[[2]]$children,function(x){x$data$created}),origin="1970/01/01")


> head(titles)
[1] "Pendragon 3-day-banning someone for randoming in ranked, or saying hes going to. Mixed feelings..."
[2] "Dig Kicks L0cust."                                                                                 
[3] "Summoners, y u no communicate??"                                                                   
[4] "Without Even Trying"                                                                               
[5] "Cross Country Tryndamere (Chaox Stream)"                                                           
[6] "Top 5 Flops - Episode 4 ft Dyrus, Phantoml0rd, and HatPerson vs Baron Nashor"                      
> 

To investigate how these ids are generated we can apply @Ben Bolker s base36ToInteger function to the ids we have gathered and compare them against the date they were created giving:

createData<-data.frame(created=created,ids=sapply(ids,base36ToInteger))
> dput(createData)
structure(list(created = structure(c(1342658844, 1342657298, 
1342622962, 1342643655, 1342641187, 1342654768, 1342665353, 1342640599, 
1342648272, 1342662822, 1342654185, 1342659591, 1342624350, 1342647907, 
1342637587, 1342591960, 1342625515, 1342642330, 1342651384, 1342668363, 
1342608976, 1342608165, 1342632545, 1342638611, 1342643489), class = c("POSIXct", 
"POSIXt")), ids = c(55047001, 55044612, 55010018, 55025557, 55022809, 
55040754, 55056689, 55022221, 55031424, 55053023, 55039810, 55048123, 
55010880, 55030934, 55019343, 54976515, 55011555, 55024060, 55035670, 
55061120, 54998192, 54997264, 55015528, 55020295, 55025363)), .Names = c("created", 
"ids"), row.names = c("wrujd", "wrsp0", "wr202", "wrdzp", "wrbvd", 
"wrppu", "ws20h", "wrbf1", "wriio", "wrz6n", "wrozm", "wrvej", 
"wr2o0", "wri52", "wr973", "wqc5f", "wr36r", "wrcu4", "wrlsm", 
"ws5fk", "wqsvk", "wqs5s", "wr694", "wr9xj", "wrdub"), class = "data.frame")

enter image description here

which implies that reddit generates these numbers sequentially across the site as new posts are created.

Without a specific direction I will leave it at this but hopefully you get the idea.

like image 141
shhhhimhuntingrabbits Avatar answered Jan 01 '23 14:01

shhhhimhuntingrabbits


I started from code for generic base conversion posted by Erich Neuwirth on r-help in 2008: this is recursive, so may be slow -- but it took the right amount of time for me to develop it!

numberInBase <- function(number,base){
    numberInBaseRecur<-function(number,base){
        lastDigit<-function(number,base) number %% base
        if (number == 0) result <- c(0)
        else result <- c(numberInBaseRecur(number %/% base,base),
                         lastDigit(number,base))
        result
    }
    result <- numberInBaseRecur(number,base)
    while (result[1]== 0 && length(result)>1)
        result <- result[-1]
    result
} 

A quick test:

numberInBase(36^3,36)
## [1] 1 0 0 0

Now all we need is to convert from decimal to base 36, then index the appropriate alphanumeric string. Here's your example:

b36string <- c(0:9,letters)
paste(b36string[numberInBase(54941875,36)+1],collapse="")
## [1] "wplf7"

If you need to go the other way, there is a post by Jim Holtman from Jan 2012 that gives a solution:

base36ToInteger <- function (Str) {
    common <- chartr("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
                     , ":;<=>?@ABCDEFGHIJKLMNOPQRS:;<=>?@ABCDEFGHIJKLMNOPQRS"
                     , Str)
    x <- as.numeric(charToRaw(common)) - 48
    sum(x * 36 ^ rev(seq(length(x)) - 1))
} 

base36ToInteger("wplf7")

(I haven't stopped to figure out how this actually works, but you can read the post ...)

like image 42
Ben Bolker Avatar answered Jan 01 '23 14:01

Ben Bolker