Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset R data frame contingent on the value of duplicate variables

How can I subset the following example data frame to only return one observation for the earliest occurance [i.e. min(year)] of each id?

id <- c("A", "A", "C", "D", "E", "F")
year <- c(2000, 2001, 2001, 2002, 2003, 2004)
qty  <- c(100, 300, 100, 200, 100, 500)
df=data.frame(year, qty, id)

In the example above there are two observations for the "A" id at years 2000 and 2001. In the case of duplicate id's, I would like the subset data frame to only include the the first occurance (i.e. at 2000) of the observations for the duplicate id.

df2 = subset(df, ???)

This is what I am trying to return:

df2

year qty id
2000 100  A
2001 100  C
2002 200  D
2003 100  E
2004 500  F

Any assistance would be greatly appreciated.

like image 433
MikeTP Avatar asked Jun 26 '12 22:06

MikeTP


3 Answers

You can aggregate on minimum year + id, then merge with the original data frame to get qty:

df2 <- merge(aggregate(year ~ id, df1, min), df1)

# > df2
#   id year qty
# 1  A 2000 100
# 2  C 2001 100
# 3  D 2002 200
# 4  E 2003 100
# 5  F 2004 500
like image 168
neilfws Avatar answered Nov 17 '22 15:11

neilfws


Is this what you're looking for? Your second row looks wrong to me (it's the duplicated year, not the first).

> duplicated(df$year)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE
> df[!duplicated(df$year), ]
  year qty id
1 2000 100  A
2 2001 300  A
4 2002 200  D
5 2003 100  E
6 2004 500  F

Edit 1: Er, I completely misunderstood what you were asking for. I'll keep this here for completeness though.

Edit 2:

Ok, here's a solution: Sort by year (so the first entry per ID has the earliest year) and then use duplicated. I think this is the simplest solution:

> df.sort.year <- df[order(df$year), ]
> df.sort.year[!duplicated(df$id),  ]
  year qty id
1 2000 100  A
3 2001 100  C
4 2002 200  D
5 2003 100  E
6 2004 500  F
like image 8
Vince Avatar answered Nov 17 '22 13:11

Vince


Using plyr

library(plyr)
## make sure first row will be min (year)
df <- arrange(df, id, year)
df2 <- ddply(df, .(id), head, n = 1)


df2
##   year qty id
## 1 2000 100  A
## 2 2001 100  C
## 3 2002 200  D
## 4 2003 100  E
## 5 2004 500  F

or using data.table. Setting the key as id, year will ensure the first row is the minimum of year.

library(data.table)
DF <- data.table(df, key = c('id','year'))
DF[,.SD[1], by = 'id']

##      id year qty
## [1,]  A 2000 100
## [2,]  C 2001 100
## [3,]  D 2002 200
## [4,]  E 2003 100
## [5,]  F 2004 500
like image 5
mnel Avatar answered Nov 17 '22 13:11

mnel