Following is what my dataframe/data.table looks like. The rank
column is my desired calculated field.
library(data.table)
df <- fread('
Name Score Date Rank
John 42 1/1/2018 3
Rob 85 12/31/2017 2
Rob 89 12/26/2017 1
Rob 57 12/24/2017 1
Rob 53 08/31/2017 1
Rob 72 05/31/2017 2
Kate 87 12/25/2017 1
Kate 73 05/15/2017 1
')
df[,Date:= as.Date(Date, format="%m/%d/%Y")]
I am trying to calculate the rank of each student at every given point in time in the data within a 30 day windows. For that, I need to fetch the most recent scores of all students at a given point in time and then pass the rank function.
In the 1st row, as of 1/1/2018
, John
has two more competitors in a past 30 day window: Rob with the most recent score of 85
in 12/31/2017
AND Kate with the most recent score of 87
in 12/25/2017
and both of these dates fall within the 1/1/2018 - 30
Day Window. John gets a rank of 3
with the lowest score of 42
. If only one students falls within date(at a given row) - 30 day window
, then the rank is 1.
In the 3rd row the date is 12/26/2017
. So Rob's score as of 12/26/2017
is 89
. There is only one case of another student that falls in the time window of 12/26/2017 - 30
days and that is the most recent score(87
) of kate on 12/25/2017
. So within the time window of (12/26/2017) - 30
, Rob's score of 89
is higher than Kate's score of 87
and therefore Rob gets rank 1
.
I was thinking about using the framework from here Efficient way to perform running total in the last 365 day window but struggling to think of a way to fetch all recent score of all students at a given point in time before using rank.
This seems to work:
ranks = df[.(d_dn = Date - 30L, d_up = Date), on=.(Date >= d_dn, Date <= d_up), allow.cart=TRUE][,
.(LatestScore = last(Score)), by=.(Date = Date.1, Name)]
setorder(ranks, Date, -LatestScore)
ranks[, r := rowid(Date)]
df[ranks, on=.(Name, Date), r := i.r]
Name Score Date Rank r
1: John 42 2018-01-01 3 3
2: Rob 85 2017-12-31 2 2
3: Rob 89 2017-12-26 1 1
4: Rob 57 2017-12-24 1 1
5: Rob 53 2017-08-31 1 1
6: Rob 72 2017-05-31 2 2
7: Kate 87 2017-12-25 1 1
8: Kate 73 2017-05-15 1 1
... using last
since the Cartesian join seems to sort and we want the latest measurement.
How the update join works
The i.
prefix means it's a column from i
in the x[i, ...]
join, and the assignment :=
is always in x
. So it's looking up each row of i
in x
and where matches are found, copying values from i
to x
.
Another way that is sometimes useful is to look up x
rows in i
, something like df[, r := ranks[df, on=.(Name,Date), x.r]]
in which case x.r
is still from the ranks
table (now in the x
position relative to the join).
There's also...
ranks = df[CJ(Name = Name, Date = Date, unique=TRUE), on=.(Name, Date), roll=30, nomatch=0]
setnames(ranks, "Score", "LatestScore")
# and then use the same last three lines above
I'm not sure about efficiency of one vs another, but I guess it depends on number of Names, frequency of measurement and how often measurement days coincide.
A solution that uses data.table
though not sure if it is the most efficient usage:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Rank=frank(-c(iScore[1L], .SD[Name != iName, max(Score), by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
1) The outer square brackets do a non-equi join within a date range (i.e. 30days ago and latest date for each row). Try studying the below output against the input data:
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
c(.(RowGroup=.GRP),
.SD[, .(Name, Score, Date, OrigDate, iName, iScore, iDate, StartDate, EndDate)]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
2) .EACHI
is to perform j
calculations for each row of i
.
3) Inside j
, iScore[1L]
is the score for the current row, .SD[Name != iName]
means taking scores not corresponding to the student in the current row. Then, we use the max(Score)
for each student of those students within the 30days window.
4) Concatenate all these scores and calculate the rank for the score of the current row while taking care of ties by taking the first tie.
see ?data.table
to understand what i
, j
, by
, on
and .EACHI
refers to.
I would add a OrigDate column and find those that matches the latest date.
df[, OrigDate := Date]
df[.(iName=Name, iScore=Score, iDate=Date, StartDate=Date-30, EndDate=Date),
.(Name=iName, Score=iScore, Date=iDate,
Rank=frank(-c(iScore[1L],
.SD[Name != iName, Score[OrigDate==max(OrigDate)], by=.(Name)]$V1),
ties.method="first")[1L]),
by=.EACHI,
on=.(Date >= StartDate, Date <= EndDate)]
I came up with following partial solution, encountered however problem - is it possible that there will be two people occuring with the same date?
if not, have a look at following piece of code:
library(tidyverse) # easy manipulation
library(lubridate) # time handling
# This function can be added to
get_top <- function(df, date_sel) {
temp <- df %>%
filter(Date > date_sel - months(1)) %>% # look one month in the past from given date
group_by(Name) %>% # and for each occuring name
summarise(max_score = max(Score)) %>% # find the maximal score
arrange(desc(max_score)) %>% # sort them
mutate(Rank = 1:n()) # and rank them
temp
}
Now, you have to find the name in the table, for given date and return its rank.
library(data.table)
library(magrittr)
setorder(df, -Date)
fun <- function(i){
df[i:nrow(df), head(.SD, 1), by = Name] %$%
rank(-Score[Date > df$Date[i] - 30])[1]
}
df[, rank := sapply(1:.N, fun)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With