Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rank most recent scores of students within a given date - 30 days window

Following is what my dataframe/data.table looks like. The rank column is my desired calculated field.

library(data.table)
df <- fread('
             Name   Score         Date              Rank
             John    42         1/1/2018              3   
             Rob     85         12/31/2017            2
             Rob     89         12/26/2017            1
             Rob     57         12/24/2017            1
             Rob     53         08/31/2017            1
             Rob     72         05/31/2017            2
             Kate    87         12/25/2017            1
             Kate    73         05/15/2017            1
             ')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]

I am trying to calculate the rank of each student at every given point in time in the data within a 30 day windows. For that, I need to fetch the most recent scores of all students at a given point in time and then pass the rank function.

In the 1st row, as of 1/1/2018, John has two more competitors in a past 30 day window: Rob with the most recent score of 85 in 12/31/2017 AND Kate with the most recent score of 87 in 12/25/2017 and both of these dates fall within the 1/1/2018 - 30 Day Window. John gets a rank of 3 with the lowest score of 42. If only one students falls within date(at a given row) - 30 day window, then the rank is 1.

In the 3rd row the date is 12/26/2017. So Rob's score as of 12/26/2017 is 89. There is only one case of another student that falls in the time window of 12/26/2017 - 30 days and that is the most recent score(87) of kate on 12/25/2017. So within the time window of (12/26/2017) - 30 , Rob's score of 89 is higher than Kate's score of 87 and therefore Rob gets rank 1.

I was thinking about using the framework from here Efficient way to perform running total in the last 365 day window but struggling to think of a way to fetch all recent score of all students at a given point in time before using rank.

like image 906
gibbz00 Avatar asked Apr 02 '18 22:04

gibbz00


4 Answers

This seems to work:

ranks = df[.(d_dn = Date - 30L, d_up = Date), on=.(Date >= d_dn, Date <= d_up), allow.cart=TRUE][, 
  .(LatestScore = last(Score)), by=.(Date = Date.1, Name)]

setorder(ranks, Date, -LatestScore)
ranks[, r := rowid(Date)]

df[ranks, on=.(Name, Date), r := i.r]

   Name Score       Date Rank r
1: John    42 2018-01-01    3 3
2:  Rob    85 2017-12-31    2 2
3:  Rob    89 2017-12-26    1 1
4:  Rob    57 2017-12-24    1 1
5:  Rob    53 2017-08-31    1 1
6:  Rob    72 2017-05-31    2 2
7: Kate    87 2017-12-25    1 1
8: Kate    73 2017-05-15    1 1

... using last since the Cartesian join seems to sort and we want the latest measurement.

How the update join works

The i. prefix means it's a column from i in the x[i, ...] join, and the assignment := is always in x. So it's looking up each row of i in x and where matches are found, copying values from i to x.

Another way that is sometimes useful is to look up x rows in i, something like df[, r := ranks[df, on=.(Name,Date), x.r]] in which case x.r is still from the ranks table (now in the x position relative to the join).


There's also...

ranks = df[CJ(Name = Name, Date = Date, unique=TRUE), on=.(Name, Date), roll=30, nomatch=0]
setnames(ranks, "Score", "LatestScore")

# and then use the same last three lines above    

I'm not sure about efficiency of one vs another, but I guess it depends on number of Names, frequency of measurement and how often measurement days coincide.

like image 140
Frank Avatar answered Nov 05 '22 18:11

Frank


A solution that uses data.table though not sure if it is the most efficient usage:

df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date), 
    .(Rank=frank(-c(iScore[1L], .SD[Name != iName, max(Score), by=.(Name)]$V1), 
        ties.method="first")[1L]), 
    by=.EACHI, 
    on=.(Date >= StartDate, Date <= EndDate)]

Explanation:

1) The outer square brackets do a non-equi join within a date range (i.e. 30days ago and latest date for each row). Try studying the below output against the input data:

df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
    c(.(RowGroup=.GRP), 
        .SD[, .(Name, Score, Date, OrigDate, iName, iScore, iDate, StartDate, EndDate)]),
    by=.EACHI,
    on=.(Date >= StartDate, Date <= EndDate)]

2) .EACHI is to perform j calculations for each row of i.

3) Inside j, iScore[1L] is the score for the current row, .SD[Name != iName] means taking scores not corresponding to the student in the current row. Then, we use the max(Score) for each student of those students within the 30days window.

4) Concatenate all these scores and calculate the rank for the score of the current row while taking care of ties by taking the first tie.

Note:

see ?data.table to understand what i, j, by, on and .EACHI refers to.


EDIT after comments by OP:

I would add a OrigDate column and find those that matches the latest date.

df[, OrigDate := Date]

df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date), 
    .(Name=iName, Score=iScore, Date=iDate, 
        Rank=frank(-c(iScore[1L], 
                .SD[Name != iName, Score[OrigDate==max(OrigDate)], by=.(Name)]$V1), 
            ties.method="first")[1L]), 
    by=.EACHI, 
    on=.(Date >= StartDate, Date <= EndDate)]
like image 22
chinsoon12 Avatar answered Nov 05 '22 18:11

chinsoon12


I came up with following partial solution, encountered however problem - is it possible that there will be two people occuring with the same date?

if not, have a look at following piece of code:

library(tidyverse) # easy manipulation
library(lubridate) # time handling
# This function can be added to 
get_top <- function(df, date_sel) {
  temp <- df %>% 
    filter(Date > date_sel - months(1)) %>% # look one month in the past from given date
    group_by(Name) %>% # and for each occuring name
    summarise(max_score = max(Score)) %>% # find the maximal score
    arrange(desc(max_score)) %>% # sort them
    mutate(Rank = 1:n()) # and rank them
  temp
}

Now, you have to find the name in the table, for given date and return its rank.

like image 1
johnnyheineken Avatar answered Nov 05 '22 18:11

johnnyheineken


library(data.table)
library(magrittr)

setorder(df, -Date)

fun <- function(i){
    df[i:nrow(df), head(.SD, 1), by = Name] %$% 
        rank(-Score[Date > df$Date[i] - 30])[1]
}
df[, rank := sapply(1:.N, fun)]
like image 1
IceCreamToucan Avatar answered Nov 05 '22 18:11

IceCreamToucan