Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Keep first observation per group identified by multiple variables (Stata equivalent "bys var1 var2 : keep if _n == 1")

Tags:

r

stata

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.

Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:

id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)

Should be transformed to:

   id day value   
   1   1    10 
   1   2    15 
   1   3    20 
   2   1    40 
   2   2    30 
   3   2    22 
   3   3    24 
   4   1    11 
   4   2    11 
   4   3    12 

By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).

In Stata, this would simply be:

bys id day (value): keep if _n == 1

I found a piece of code on the web, which properly does that if I first produce a single group identifier :

mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="")  ### the single group identifier

myid.uni <- unique(mydata$id1)
a<-length(myid.uni)

last <- c()

for (i in 1:a) {
  temp<-subset(mydata, id1==myid.uni[i])
  if (dim(temp)[1] > 1) {
    last.temp<-temp[dim(temp)[1],]
  }
  else {
    last.temp<-temp
  }
  last<-rbind(last, last.temp)
}

last

However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.

Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?

like image 971
iraserd Avatar asked Feb 04 '14 12:02

iraserd


1 Answers

Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:

library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]

Using dplyr with magrittr pipes:

library(dplyr)
mydata <- mydata %>%
  group_by(id, day) %>%
  slice(1) %>%
  ungroup()

If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.

like image 187
Feakster Avatar answered Oct 11 '22 14:10

Feakster