Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

select last observation from longitudinal data

Tags:

r

I have a data set with several time assessments for each participant. I want to select the last assessment for each participant. My dataset looks like this:

ID  week  outcome
1   2   14
1   4   28
1   6   42
4   2   14
4   6   46
4   9   64
4   9   71
4  12   85
9   2   14
9   4   28
9   6   51
9   9   66
9  12   84

I want to select only the last observation/assessment for each participant, but I only have the number of weeks as an indicator for each participant. How is this possible to do in R (or excel?)

thanks in advance,

niki

like image 495
user1236418 Avatar asked Feb 27 '12 20:02

user1236418


3 Answers

Here is one base-R approach:

do.call("rbind", 
        by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week), ]))
  ID week outcome
1  1    6      42
4  4   12      85
9  9   12      84

Alternatively, the data.table package offers a succinct and expressive language for performing data frame manipulations of this type:

library(data.table)
dt <- data.table(df, key="ID")

dt[, .SD[which.max(outcome), ], by=ID] 
#      ID week outcome
# [1,]  1    6      42
# [2,]  4   12      85
# [3,]  9   12      84

# Same but much faster. 
# (Actually, only the same as long as there are no ties for max(outcome)..)
dt[ dt[,outcome==max(outcome),by=ID][[2]] ]   # same, but much faster.

# If there are ties for max(outcome), the following will still produce
# the same results as the method using .SD, but will be faster
i1 <- dt[,which.max(outcome), by=ID][[2]]
i2 <- dt[,.N, by=ID][[2]]
dt[i1 + cumsum(i2) - i2,]

Finally, here is a plyr-based solution

library(plyr)

ddply(df, .(ID), function(X) X[which.max(X$week), ])
#   ID week outcome
# 1  1    6      42
# 2  4   12      85
# 3  9   12      84
like image 146
Josh O'Brien Avatar answered Nov 16 '22 03:11

Josh O'Brien


If you're just looking for the last observation per person ID, then a simple two line code should do it. I am up always for simple base solution when possible while it is always great to have more than one ways to solve a problem.

dat[order(dat$ID,dat$Week),]  # Sort by ID and week
dat[!duplicated(dat$ID, fromLast=T),] # Keep last observation per ID

   ID Week Outcome
3   1    6      42
8   4   12      85
13  9   12      84
like image 27
user1234357 Avatar answered Nov 16 '22 02:11

user1234357


I can play this game. I ran some benchmarks on differences between lapply, sapply, and by, among other things. It appears to me that the more you're in control of data types and the more basic the operation, the faster it is (e.g., lapply is generally faster than sapply, and as.numeric(lapply(...)) is going to be faster, also). With that in mind, this produced the same results as above and may be faster than the rest.

df[cumsum(as.numeric(lapply(split(df$week, df$id), which.max))), ]

Explanation: we only want which.max on the week per each id. That handles the contents of lapply. We only need the vector of these relative points, so make it numeric. The result is the vector (3, 5, 5). We need to add the positions of the prior maxes. This is accomplished with cumsum.

It should be noted, this solution is not general when I use cumsum. It may require that prior to execution we sort the frame on id and week. I hope you understand why (and know how to use with(df, order(id, week)) in the row index to achieve that). In any case, it may still fail if we don't have a unique max, because which.max only takes the first one. Therefore, my solution is a bit question begging, but that goes without saying. We're trying to extract very specific information for a very specific example. Our solutions can't be general (even though the methods are important to understand generally).

I'll leave it to trinker to update his comparisons!

like image 34
Bryan Goodrich Avatar answered Nov 16 '22 04:11

Bryan Goodrich