Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table get N most frequent values by group

Tags:

r

data.table

Let's say I want to find, for each purchase category, the top 3 most frequently occurring zip codes. In this example, the categories are home, townhouse, and condo. I have transactional data like:

set.seed(1234)
d <- data.table(purch_id = 1:3e6,
                purch_cat = sample(x = c('home','townhouse','condo'), 
                                   size = 3e6, replace=TRUE),
                purch_zip = formatC( sample(x = 1e4:9e4, size = 3e6, replace=TRUE), 
                                     width = 5, format = "d", flag = "0") )

I know that I can do this:

# there has to be a better way...
d[,list(purch_count = length(purch_id)), 
  by=list(purch_cat, purch_zip)][, purch_rank := rank(-purch_count, ties.method='min'), 
                                 by=purch_cat][purch_rank<=3,][order(purch_cat, purch_rank)]

    purch_cat purch_zip purch_count purch_rank
 1:     condo     39169          32          1
 2:     condo     15725          31          2
 3:     condo     75768          30          3
 4:     condo     72023          30          3
 5:      home     71294          30          1
 6:      home     56053          30          1
 7:      home     57971          29          3
 8:      home     77521          29          3
 9:      home     70124          29          3
10:      home     25302          29          3
11:      home     65292          29          3
12:      home     39488          29          3
13: townhouse     39587          33          1
14: townhouse     80365          30          2
15: townhouse     37360          30          2

But this can't be the most elegant data.table approach, and it seems kind of slow.

Any suggestions to reduce the number of passes? Maybe something using table()? TYVM!

like image 944
C8H10N4O2 Avatar asked Dec 14 '22 12:12

C8H10N4O2


2 Answers

One approach would be

d[ , .N, by=.(purch_cat, purch_zip)][
  order(-N), 
  .SD[ N >= unique(N)[3] ]
,by=purch_cat]

which gives

    purch_cat purch_zip  N
 1: townhouse     39587 33
 2: townhouse     80365 30
 3: townhouse     37360 30
 4: townhouse     83099 28
 5: townhouse     33518 28
 6: townhouse     59347 28
 7: townhouse     22402 28
 8:     condo     39169 32
 9:     condo     15725 31
10:     condo     75768 30
11:     condo     72023 30
12:      home     71294 30
13:      home     56053 30
14:      home     57971 29
15:      home     77521 29
16:      home     70124 29
17:      home     25302 29
18:      home     65292 29
19:      home     39488 29
20:      home     81754 28
21:      home     43426 28
22:      home     16943 28
23:      home     88978 28
24:      home     43003 28
25:      home     76501 28
    purch_cat purch_zip  N

To achieve the OP's tie-breaking rule, one could do

d[ , .N, by=.(purch_cat,purch_zip)][
  order(-N),
  .SD[ N >= unique(N)[3] ][
      .N - frank(N, ties.method='max') < 3 ]
, by=purch_cat]

which gives

    purch_cat purch_zip  N
 1: townhouse     39587 33
 2: townhouse     80365 30
 3: townhouse     37360 30
 4:     condo     39169 32
 5:     condo     15725 31
 6:     condo     75768 30
 7:     condo     72023 30
 8:      home     71294 30
 9:      home     56053 30
10:      home     57971 29
11:      home     77521 29
12:      home     70124 29
13:      home     25302 29
14:      home     65292 29
15:      home     39488 29

Following @MichaelChirico's answer, this approach adds an frank step.

like image 177
Frank Avatar answered Jan 07 '23 03:01

Frank


Edit: improved.

I think you were exactly on the right track. One key thing you're missing, however, is the function frank, which has been optimized and should speed up your code considerably (runs almost instantaneously on your 3m row sample data):

d[ , .(purch_count = .N), 
  by = .(purch_cat, purch_zip)
  ][, purch_rank := frank(-purch_count, ties.method = 'min'), 
    keyby = purch_cat
    ][purch_rank <= 3,
      ][order(purch_cat, purch_rank)]
    purch_cat purch_zip purch_count purch_rank
 1:     condo     39169          32          1
 2:     condo     15725          31          2
 3:     condo     75768          30          3
 4:     condo     72023          30          3
 5:      home     71294          30          1
 6:      home     56053          30          1
 7:      home     57971          29          3
 8:      home     77521          29          3
 9:      home     70124          29          3
10:      home     25302          29          3
11:      home     65292          29          3
12:      home     39488          29          3
13: townhouse     39587          33          1
14: townhouse     80365          30          2
15: townhouse     37360          30          2

Incomplete answer with table (slow):

Yes, one way involves using table.

d[ , {x <- table(purch_zip)
x <- x[order(-x)]
names(x[x %in% unique(x)[1:3]])
}, keyby = purch_cat]
    purch_cat    V1
 1:     condo 39169
 2:     condo 15725
 3:     condo 72023
 4:     condo 75768
 5:      home 56053
 6:      home 71294
 7:      home 25302
 8:      home 39488
 9:      home 57971
10:      home 65292
11:      home 70124
12:      home 77521
13:      home 16943
14:      home 43003
15:      home 43426
16:      home 76501
17:      home 81754
18:      home 88978
19: townhouse 39587
20: townhouse 37360
21: townhouse 80365
22: townhouse 22402
23: townhouse 33518
24: townhouse 59347
25: townhouse 83099
    purch_cat    V1
like image 28
MichaelChirico Avatar answered Jan 07 '23 04:01

MichaelChirico