I have a data frame containing a group identifier and a score for an observation from that group for an activity. Both group letters and scores can repeat.
library(tidyverse)
library(data.table)
set.seed(1)
dat <- data.frame(rowid = 1:1000,
grp = sample(LETTERS[1:20], 1000, replace = T),
score = sample(1:5, 1000, replace = T))
For each group, I need to rank their distinct scores in descending order; for each unique group and score combination, I would like to retain a single row. Ideally, this row would be the row with the smallest rowid, but this is not absolutely necessary. If it is more efficient to randomly select a row from each group and score combination, that works just as well.
dat %>%
group_by(grp) %>%
mutate(score_rank = dense_rank(desc(score))) %>%
arrange(grp, score_rank, rowid) %>% ## this step not completely necessary (see above)
group_by(grp, score_rank) %>%
mutate(score_rank_row = row_number()) %>%
filter(score_rank_row == 1) %>%
arrange(grp, -score)
The code above accomplishes what I want to do, but in reality, my actual data frame has 20m+ rows, has 5,000 values for group, and each group can have hundreds of scores, many of which repeat. As a result, when I try to use the solution presented above for my actual use case, my computer keeps running and running without finishing the operation.
My understanding is that data.table is much more efficient at handling these types of tasks, but I cannot for the life of me figure out the syntax to use it. Can anyone help me out?
With data.table, we can use
library(data.table)
setDT(dat)[, score_rank := frank(-score, ties.method = 'dense'), grp]
setkey(dat, grp, score_rank, rowid)
dat[, score_rank_row := rowid(grp, score_rank)][
score_rank_row == 1][order(grp, -score)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With