Let's say I want to find, for each purchase category, the top 3 most frequently occurring zip codes. In this example, the categories are home, townhouse, and condo. I have transactional data like:
set.seed(1234)
d <- data.table(purch_id = 1:3e6,
purch_cat = sample(x = c('home','townhouse','condo'),
size = 3e6, replace=TRUE),
purch_zip = formatC( sample(x = 1e4:9e4, size = 3e6, replace=TRUE),
width = 5, format = "d", flag = "0") )
I know that I can do this:
# there has to be a better way...
d[,list(purch_count = length(purch_id)),
by=list(purch_cat, purch_zip)][, purch_rank := rank(-purch_count, ties.method='min'),
by=purch_cat][purch_rank<=3,][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
But this can't be the most elegant data.table
approach, and it seems kind of slow.
Any suggestions to reduce the number of passes? Maybe something using table()
? TYVM!
One approach would be
d[ , .N, by=.(purch_cat, purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ]
,by=purch_cat]
which gives
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: townhouse 83099 28
5: townhouse 33518 28
6: townhouse 59347 28
7: townhouse 22402 28
8: condo 39169 32
9: condo 15725 31
10: condo 75768 30
11: condo 72023 30
12: home 71294 30
13: home 56053 30
14: home 57971 29
15: home 77521 29
16: home 70124 29
17: home 25302 29
18: home 65292 29
19: home 39488 29
20: home 81754 28
21: home 43426 28
22: home 16943 28
23: home 88978 28
24: home 43003 28
25: home 76501 28
purch_cat purch_zip N
To achieve the OP's tie-breaking rule, one could do
d[ , .N, by=.(purch_cat,purch_zip)][
order(-N),
.SD[ N >= unique(N)[3] ][
.N - frank(N, ties.method='max') < 3 ]
, by=purch_cat]
which gives
purch_cat purch_zip N
1: townhouse 39587 33
2: townhouse 80365 30
3: townhouse 37360 30
4: condo 39169 32
5: condo 15725 31
6: condo 75768 30
7: condo 72023 30
8: home 71294 30
9: home 56053 30
10: home 57971 29
11: home 77521 29
12: home 70124 29
13: home 25302 29
14: home 65292 29
15: home 39488 29
Following @MichaelChirico's answer, this approach adds an frank
step.
I think you were exactly on the right track. One key thing you're missing, however, is the function frank
, which has been optimized and should speed up your code considerably (runs almost instantaneously on your 3m row sample data):
d[ , .(purch_count = .N),
by = .(purch_cat, purch_zip)
][, purch_rank := frank(-purch_count, ties.method = 'min'),
keyby = purch_cat
][purch_rank <= 3,
][order(purch_cat, purch_rank)]
purch_cat purch_zip purch_count purch_rank
1: condo 39169 32 1
2: condo 15725 31 2
3: condo 75768 30 3
4: condo 72023 30 3
5: home 71294 30 1
6: home 56053 30 1
7: home 57971 29 3
8: home 77521 29 3
9: home 70124 29 3
10: home 25302 29 3
11: home 65292 29 3
12: home 39488 29 3
13: townhouse 39587 33 1
14: townhouse 80365 30 2
15: townhouse 37360 30 2
table
(slow):Yes, one way involves using table
.
d[ , {x <- table(purch_zip)
x <- x[order(-x)]
names(x[x %in% unique(x)[1:3]])
}, keyby = purch_cat]
purch_cat V1
1: condo 39169
2: condo 15725
3: condo 72023
4: condo 75768
5: home 56053
6: home 71294
7: home 25302
8: home 39488
9: home 57971
10: home 65292
11: home 70124
12: home 77521
13: home 16943
14: home 43003
15: home 43426
16: home 76501
17: home 81754
18: home 88978
19: townhouse 39587
20: townhouse 37360
21: townhouse 80365
22: townhouse 22402
23: townhouse 33518
24: townhouse 59347
25: townhouse 83099
purch_cat V1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With