Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep only groups of data with multiple observations

Tags:

r

dplyr

I am attempting to keep only deids with multiple observations.

I have the below code

help <- data.frame(deid = c(1, 5, 5, 5, 5, 5, 5, 12, 12, 12, 12),
                   session.number = c(1, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4),
                   days.since.last = c(0, 0, 7, 14, 93, 5, 102, 0, 21, 104, 4))

   deid session.number days.since.last
1     1              1               0
2     5              1               0
3     5              2               7
4     5              3              14
5     5              4              93
6     5              5               5
7     5              6             102
8    12              1               0
9    12              2              21
10   12              3             104
11   12              4               4

My feeble attempt was to use the group_by and then the filter( ) command

help %>% group_by(deid) %>% filter(session.number >=2)

However, it only keeps session.number's at 2 or greater. So I get rid of the deid = 1, but all the remaining deid data starts at session.number 2, and not session.number 1.

What I am trying to tell R is to keep the groups (deid) with greater than 1 observation (session.number)

Any assistance is greatly appreciated.

like image 358
b222 Avatar asked Jun 10 '15 00:06

b222


2 Answers

this should do it - you need to filter by number of observations in each group which is got using n():

help %>% group_by(deid) %>% filter(n()>1)

  deid session.number days.since.last
1     5              1               0
2     5              2               7
3     5              3              14
4     5              4              93
5     5              5               5
6     5              6             102
7    12              1               0
8    12              2              21
9    12              3             104
10   12              4               4
like image 93
jalapic Avatar answered Oct 24 '22 17:10

jalapic


Using data.table instead:

helpcount <- help[, list(Count = .N), by = deid]
helpf <- merge(help,helpcount, by = "deid")
helpf <- helpf[Count > 1]

EDIT: A bit more concise:

help[, Count := .N, by = deid]
help[Count > 1]

EDIT2: thelatemail's even more concise solution:

help[,if(.N > 1) .SD, by=deid]
like image 34
Chris Avatar answered Oct 24 '22 18:10

Chris