I need to ID chunks of rows in a data.table by a start-row and an end-row criteria. In the MWE below, the start-row is defined by colA=="d", and the group continues until colA=="a"
library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
in.data$wanted.column <- c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, NA)
in.data
# colA wanted.column
# 1: b NA
# 2: f NA
# 3: b NA
# 4: k NA
# 5: d 1
# 6: b 1
# 7: a 1
# 8: s NA
# 9: a NA
# 10: n NA
# 11: d 2
# 12: f 2
# 13: d 2
# 14: a 2
# 15: t NA
(It doesn't matter if out-of-group values are NA, zero or any other identifiable result)
The original version of the answer looked for the shortest sequences, which was not right because they can contain starting symbol in the middle, e.g. c('d','f','d','a')
. The edited version of the answer fixes this problem
I was informed that when two sequences follow each other (e.g. in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "d", "f", "d", "a", "t"))
), they are enumerated as one solution, which is wrong. Here, I fix this problem by keeping track of the occurences of symbol.stop
symbols in colA
.
Setup
library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
symbol.start='d'
symbol.stop='a'
Actual code
in.data[,y := rev(cumsum(rev(colA)==symbol.stop))][,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y]
in.data$out[in.data$out] <- as.factor(max(in.data$y)-in.data$y[in.data$out])
Here, [,y := rev(cumsum(rev(colA)==symbol.stop))]
creates a column y
that can be used to group the data set by the occurrences of symbol.stop
from the back side. The [,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y]
expression returns a boolean vector that tells whether a row belongs to the start.symbol...end.symbol
sequence. The next line is needed to enumerate such sequences.
Clean up and output
in.data$y <- NULL
in.data
# colA out
# 1: b 0
# 2: f 0
# 3: b 0
# 4: k 0
# 5: d 1
# 6: b 1
# 7: a 1
# 8: s 0
# 9: a 0
# 10: n 0
# 11: d 2
# 12: f 2
# 13: d 2
# 14: a 2
# 15: t 0
Just in case somebody needs it, the one-liner solution:
in.data[ , y := rev(cumsum(rev(colA)==symbol.stop))
][ , z:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N), by=y
][ z==T, out:=as.numeric(factor(y,levels=unique(y)))
][ , c('z','y'):=list(NULL,NULL)]
I am sure someone will come up with a nice data.table
solution. While waiting, here's another base
possibility:
in.df <- as.data.frame(in.data)
# index of "d", start index
start <- which(in.df$colA == "d")
# index of "a"
idx_a <- which(in.df$colA == "a")
# end index: for each start index, select the first index of "a" which is larger
end <- a[sapply(start, function(x) which.max(x < idx_a))]
# check if runs overlap and create groups of runs
lag_end <- c(0, head(end, -1))
run <- cumsum(start >= lag_end)
df <- data.frame(start, end, run)
# within each run, expand the sequence of idx, from min(start) to max(end)
df2 <- do.call(rbind,
by(df, df$run, function(x){
data.frame(run = x$run, idx = min(x$start):max(x$end))
})
)
# add an empty 'run' variable to in.df
in.df$run <- NA
# assign df2$run at idx in in.data
in.df$run[df2$idx] <- df2$run
# idx colA wanted.column run
# 1 1 b NA NA
# 2 2 f NA NA
# 3 3 b NA NA
# 4 4 k NA NA
# 5 5 d 1 1
# 6 6 b 1 1
# 7 7 a 1 1
# 8 8 s NA NA
# 9 9 a NA NA
# 10 10 n NA NA
# 11 11 d 2 2
# 12 12 f 2 2
# 13 13 d 2 2
# 14 14 a 2 2
# 15 15 t NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With