Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter data.table by quantity of group

Tags:

r

data.table

Let's say I have a data.table like

sample<-data.table(id=c(1,1,2,2,3,3,3,4,4),
                   name=c("apple","apple","orange","orange",
                          "pear","pear","pear","banana","banana"),
               atr=c("pretty","ugly","bruised","delicious",
                     "pear-shaped","bruised","infested",
                     "too-ripe","perfect"),
               N=c(10,9,15,4,5,7,7,4,12))

I want to return essentially unique(sample[,list(id, name)]) except that I also want the atr column for the value with the greatest N. In cases where there is a tie for highest N then I don't care which of the two is picked but I want only one to be picked.

This almost works merge(sample[,list(N=max(N)),by=list(id,name1)], sample,by=c("id","name1","N")) but since pear has two atr values that tie for max this returns two pears. Besides it not giving the intended result, I also assume/hope there's a way to do this that doesn't involve the join.

like image 504
Dean MacGregor Avatar asked Mar 16 '23 06:03

Dean MacGregor


2 Answers

You could use atr[N == max(N)][1] to return only the first in the case of a tie, like this -

library(data.table)

sample[, .(atr = atr[N == max(N)][1]), by = .(id, name)]
#    id   name     atr
# 1:  1  apple  pretty
# 2:  2 orange bruised
# 3:  3   pear bruised
# 4:  4 banana perfect

Note: As Frank points out atr[N == max(N)][1] is also simply atr[which.max(N)]

like image 193
Rich Scriven Avatar answered Mar 28 '23 03:03

Rich Scriven


I would just use order:

> unique(sample[order(-N), .(id, name, atr)], by = c("id", "name"))
   id   name     atr
1:  2 orange bruised
2:  4 banana perfect
3:  1  apple  pretty
4:  3   pear bruised

If you want to maintain overall sorting, just use order(id, name, -N) instead.

You could also split this to two lines:

setorder(sample, -N) #done by reference, as with all set* functions in data.table
unique(sample[ , .(id, name, atr)], by = c("id", "name"))

Or perhaps better depending on your end goal:

setkey(setorder(sample, -N), id, name)
unique(sample[ , .(id, name, atr)])

(note: order is crucial in this last, as using setorder first will overwrite keys to NULL)

like image 39
MichaelChirico Avatar answered Mar 28 '23 03:03

MichaelChirico