ID chunks of rows by start and end value

Question

I need to ID chunks of rows in a data.table by a start-row and an end-row criteria. In the MWE below, the start-row is defined by colA=="d", and the group continues until colA=="a"

library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
in.data$wanted.column <- c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, NA)

in.data
#     colA wanted.column
#  1:    b            NA
#  2:    f            NA
#  3:    b            NA
#  4:    k            NA
#  5:    d             1
#  6:    b             1
#  7:    a             1
#  8:    s            NA
#  9:    a            NA
# 10:    n            NA
# 11:    d             2
# 12:    f             2
# 13:    d             2
# 14:    a             2
# 15:    t            NA

(It doesn't matter if out-of-group values are NA, zero or any other identifiable result)

Marat Talipov · Accepted Answer

UPDATE

The original version of the answer looked for the shortest sequences, which was not right because they can contain starting symbol in the middle, e.g. c('d','f','d','a'). The edited version of the answer fixes this problem

UPDATE2

I was informed that when two sequences follow each other (e.g. in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "d", "f", "d", "a", "t"))), they are enumerated as one solution, which is wrong. Here, I fix this problem by keeping track of the occurences of symbol.stop symbols in colA.

Setup

library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
symbol.start='d'
symbol.stop='a'

Actual code

in.data[,y := rev(cumsum(rev(colA)==symbol.stop))][,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y]

in.data$out[in.data$out] <- as.factor(max(in.data$y)-in.data$y[in.data$out])

Here, [,y := rev(cumsum(rev(colA)==symbol.stop))] creates a column y that can be used to group the data set by the occurrences of symbol.stop from the back side. The [,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y] expression returns a boolean vector that tells whether a row belongs to the start.symbol...end.symbol sequence. The next line is needed to enumerate such sequences.

Clean up and output

in.data$y <- NULL   

in.data
#     colA out
#  1:    b   0
#  2:    f   0
#  3:    b   0
#  4:    k   0
#  5:    d   1
#  6:    b   1
#  7:    a   1
#  8:    s   0
#  9:    a   0
# 10:    n   0
# 11:    d   2
# 12:    f   2
# 13:    d   2
# 14:    a   2
# 15:    t   0

UPDATE3

Just in case somebody needs it, the one-liner solution:

in.data[     , y := rev(cumsum(rev(colA)==symbol.stop))
      ][     , z:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N), by=y
      ][ z==T, out:=as.numeric(factor(y,levels=unique(y)))
      ][     , c('z','y'):=list(NULL,NULL)]

Henrik · Answer

I am sure someone will come up with a nice data.table solution. While waiting, here's another base possibility:

in.df <- as.data.frame(in.data)

# index of "d", start index
start <- which(in.df$colA == "d")

# index of "a"
idx_a <- which(in.df$colA == "a")

# end index: for each start index, select the first index of "a" which is larger
end <- a[sapply(start, function(x) which.max(x < idx_a))]

# check if runs overlap and create groups of runs
lag_end <- c(0, head(end, -1))
run <- cumsum(start >= lag_end)

df <- data.frame(start, end, run)

# within each run, expand the sequence of idx, from min(start) to max(end)
df2 <- do.call(rbind,
        by(df, df$run, function(x){
          data.frame(run = x$run, idx = min(x$start):max(x$end))
        })
)

# add an empty 'run' variable to in.df
in.df$run <- NA

# assign df2$run at idx in in.data
in.df$run[df2$idx] <- df2$run

#    idx colA wanted.column run
# 1    1    b            NA  NA
# 2    2    f            NA  NA
# 3    3    b            NA  NA
# 4    4    k            NA  NA
# 5    5    d             1   1
# 6    6    b             1   1
# 7    7    a             1   1
# 8    8    s            NA  NA
# 9    9    a            NA  NA
# 10  10    n            NA  NA
# 11  11    d             2   2
# 12  12    f             2   2
# 13  13    d             2   2
# 14  14    a             2   2
# 15  15    t            NA  NA

ID chunks of rows by start and end value

Tags:

r

data.table

Chris

2 Answers

UPDATE

UPDATE2

UPDATE3

Marat Talipov

Henrik

Recent Activity

Donate For Us

ID chunks of rows by start and end value

Tags:

r

data.table

Chris

2 Answers

UPDATE

UPDATE2

UPDATE3

Marat Talipov

Henrik

Related questions

Recent Activity

Donate For Us