Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ID chunks of rows by start and end value

Tags:

r

data.table

I need to ID chunks of rows in a data.table by a start-row and an end-row criteria. In the MWE below, the start-row is defined by colA=="d", and the group continues until colA=="a"

library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
in.data$wanted.column <- c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, NA)

in.data
#     colA wanted.column
#  1:    b            NA
#  2:    f            NA
#  3:    b            NA
#  4:    k            NA
#  5:    d             1
#  6:    b             1
#  7:    a             1
#  8:    s            NA
#  9:    a            NA
# 10:    n            NA
# 11:    d             2
# 12:    f             2
# 13:    d             2
# 14:    a             2
# 15:    t            NA

(It doesn't matter if out-of-group values are NA, zero or any other identifiable result)

like image 493
Chris Avatar asked Jan 15 '15 15:01

Chris


2 Answers

UPDATE

The original version of the answer looked for the shortest sequences, which was not right because they can contain starting symbol in the middle, e.g. c('d','f','d','a'). The edited version of the answer fixes this problem

UPDATE2

I was informed that when two sequences follow each other (e.g. in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "d", "f", "d", "a", "t"))), they are enumerated as one solution, which is wrong. Here, I fix this problem by keeping track of the occurences of symbol.stop symbols in colA.

Setup

library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
symbol.start='d'
symbol.stop='a'

Actual code

in.data[,y := rev(cumsum(rev(colA)==symbol.stop))][,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y]

in.data$out[in.data$out] <- as.factor(max(in.data$y)-in.data$y[in.data$out])

Here, [,y := rev(cumsum(rev(colA)==symbol.stop))] creates a column y that can be used to group the data set by the occurrences of symbol.stop from the back side. The [,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y] expression returns a boolean vector that tells whether a row belongs to the start.symbol...end.symbol sequence. The next line is needed to enumerate such sequences.

Clean up and output

in.data$y <- NULL   

in.data
#     colA out
#  1:    b   0
#  2:    f   0
#  3:    b   0
#  4:    k   0
#  5:    d   1
#  6:    b   1
#  7:    a   1
#  8:    s   0
#  9:    a   0
# 10:    n   0
# 11:    d   2
# 12:    f   2
# 13:    d   2
# 14:    a   2
# 15:    t   0

UPDATE3

Just in case somebody needs it, the one-liner solution:

in.data[     , y := rev(cumsum(rev(colA)==symbol.stop))
      ][     , z:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N), by=y
      ][ z==T, out:=as.numeric(factor(y,levels=unique(y)))
      ][     , c('z','y'):=list(NULL,NULL)]
like image 122
Marat Talipov Avatar answered Oct 05 '22 20:10

Marat Talipov


I am sure someone will come up with a nice data.table solution. While waiting, here's another base possibility:

in.df <- as.data.frame(in.data)

# index of "d", start index
start <- which(in.df$colA == "d")

# index of "a"
idx_a <- which(in.df$colA == "a")

# end index: for each start index, select the first index of "a" which is larger
end <- a[sapply(start, function(x) which.max(x < idx_a))]

# check if runs overlap and create groups of runs
lag_end <- c(0, head(end, -1))
run <- cumsum(start >= lag_end)

df <- data.frame(start, end, run)

# within each run, expand the sequence of idx, from min(start) to max(end)
df2 <- do.call(rbind,
        by(df, df$run, function(x){
          data.frame(run = x$run, idx = min(x$start):max(x$end))
        })
)

# add an empty 'run' variable to in.df
in.df$run <- NA

# assign df2$run at idx in in.data
in.df$run[df2$idx] <- df2$run

#    idx colA wanted.column run
# 1    1    b            NA  NA
# 2    2    f            NA  NA
# 3    3    b            NA  NA
# 4    4    k            NA  NA
# 5    5    d             1   1
# 6    6    b             1   1
# 7    7    a             1   1
# 8    8    s            NA  NA
# 9    9    a            NA  NA
# 10  10    n            NA  NA
# 11  11    d             2   2
# 12  12    f             2   2
# 13  13    d             2   2
# 14  14    a             2   2
# 15  15    t            NA  NA
like image 20
Henrik Avatar answered Oct 05 '22 18:10

Henrik