Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

number rows by variable, but start over when condition is hit

I want to number certain combinations of row in a dataframe (which is ordered on ID and on Time)

tc <- textConnection('
id              time       end_yn
abc             10         0
abc             11         0
abc             12         1
abc             13         0
def             10         0
def             15         1
def             16         0
def             17         0
def             18         1
')

test <- read.table(tc, header=TRUE)

The goal is to create a new column ("number") that numbers each row per id from 1 to n until end_yn == 1 is hit. After end_yn == 1, the numbering should start over.

Without taking the end_yn == 1 condition into account the rows can be numbered using:

DT <- data.table(test)
DT[, id := seq_len(.N), by = id]

However the expected outcome should be:

id              time       end_yn   number
abc             10         0        1
abc             11         0        2
abc             12         1        3 
abc             13         0        1 
def             10         0        1
def             15         1        2
def             16         0        1
def             17         0        2
def             18         1        3

How to incorporate the end_yn == 1 condition?

like image 594
Max van der Heijden Avatar asked Oct 19 '12 07:10

Max van der Heijden


1 Answers

I'm guessing there are different ways to do this, but here's one:

DT[, cEnd := c(0,cumsum(end_yn)[-.N])] # carry the end value forward

DT[, number := seq_len(.N), by = "id,cEnd"] # create your sequence

DT[, cEnd := NULL] # remove the column created above

Setting id as the key for DT might be worth while.

like image 56
BenBarnes Avatar answered Sep 29 '22 04:09

BenBarnes