Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping rows on the basis of row differences in R [duplicate]

Tags:

r

grouping

I have a set of animal locations with different sampling intervals. What I want to do is group and seqences where the sampling interval matches a certain criteria (e.g. is below a certain value). Let me illustrate with some dummy data:

start <- Sys.time()
timediff <- c(rep(5,3),20,rep(5,2))
timediff <- cumsum(timediff)

# Set up a dataframe with a couple of time values
df <- data.frame(TimeDate = start + timediff)

# Calculate the time differences between the rows
df$TimeDiff <- c(as.integer(tail(df$TimeDate,-1) - head(df$TimeDate,-1)),NA)

# Define a criteria in order to form groups
df$TimeDiffSmall <- df$TimeDiff <= 5

             TimeDate TimeDiff TimeDiffSmall
1 2016-03-15 23:11:49        5          TRUE
2 2016-03-15 23:11:54        5          TRUE
3 2016-03-15 23:11:59       20         FALSE
4 2016-03-15 23:12:19        5          TRUE
5 2016-03-15 23:12:24        5          TRUE
6 2016-03-15 23:12:29       NA            NA

In this dummy data, rows 1:3 belong to one group, since the time difference between them is <= 5 seconds. 4 - 6 belong to the second group, but hypothetically there could be a number of rows in between the two groups that dont belong to any group (TimeDiffSmall equals to FALSE).

Combining the information from two multiple SO answers (e.g. part 1), I've create a function that solves this problem.

number.groups <- function(input){
  # part 1: numbering successive TRUE values
  input[is.na(input)] <- F
  x.gr <- ifelse(x <- input == TRUE, cumsum(c(head(x, 1), tail(x, -1) - head(x, -1) == 1)),NA)
  # part 2: including last value into group
  items <- which(!is.na(x.gr))
  items.plus <- c(1,items+1)
  sel <- !(items.plus %in% items)
  sel.idx <- items.plus[sel]
  x.gr[sel.idx] <- x.gr[sel.idx-1]
  return(x.gr)


 # Apply the function to create groups
 df$Group <- number.groups(df$TimeDiffSmall)

             TimeDate TimeDiff TimeDiffSmall Group
1 2016-03-15 23:11:49        5          TRUE     1
2 2016-03-15 23:11:54        5          TRUE     1
3 2016-03-15 23:11:59       20         FALSE     1
4 2016-03-15 23:12:19        5          TRUE     2
5 2016-03-15 23:12:24        5          TRUE     2
6 2016-03-15 23:12:29       NA            NA     2

This function actually works to solve my problem. This this is, it seems like a crazy and rookie way to go about this. Is there a function that could solve my problem more professionally?

like image 632
Ratnanil Avatar asked Mar 15 '16 22:03

Ratnanil


1 Answers

Like @thelatemail, I'd use the following to get the group IDs. It works because cumsum() will end up incrementing the group count each time it reaches an element preceded by a greater-than-5-second time interval.

df$Group <- cumsum(c(TRUE, diff(df$TimeDate) > 5))
df$Group
# [1] 1 1 1 2 2 2
like image 157
Josh O'Brien Avatar answered Nov 19 '22 01:11

Josh O'Brien