Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find consecutive sequence of zeros in R

Tags:

r

data.table

I have a data.frame really big (actually a data.table). Now, to simplify things, let's assume my data.frame is just as follow:

x <- c(1, 1, 0, 0, 1, 0, 0, NA, NA, 0) 
y <- c(1 ,0 ,NA, NA, 0, 0, 0, 1, 1, 0)
mydf <- data.frame(rbind(x,y))

I'd like to identify in which row (if any) the last sequence is formed by three consecutive zeros, not considering NAs. So, in the example above, the first row has three consecutive zeros in the last sequence, but not the second one.

I know how to do that if only I have a vector (not a data.frame):

runs <-  rle(x[is.na(x)==F])

runs$lengths[length(runs$lengths)] > 2 & runs$values[length(runs$lengths)]==0

I obviously can do a loop and I'll have what I want. But it'll be incredibly inefficient and my actual data.frame is quite big. So, any ideas on how to do in a fastest way?

I guess apply can be useful, but I'm not able to thinking of using it right now. Also, maybe there is a data.table way of doing this?

ps.: Actually, this data.frame is a reshaped version of my original data.table. If somehow I can do the job with the data.frame in the original format, it's ok. To see how is my data.frame originally, just think of it as:

x <- c(1, 1, 0, 0, 1, 0, 0, 0) 
y <- c(1 ,0 , 0, 0, 0, 1, 1, 0)

myOriginalDf <- data.frame(value=c(x,y), id=rep(c('x','y'), c(length(x), length(y))))
like image 899
Manoel Galdino Avatar asked Mar 01 '13 04:03

Manoel Galdino


4 Answers

Base R solution based on rle, which repeats each length count that many times:

rle_lens <- rle(myOriginalDf$value)$lengths
myOriginalDf$rle_len <- unlist(lapply(1:length(rle_lens), function(i) rep(rle_lens[i], rle_lens[i])))

Then you can subset rows in which value == 0 & rle_len >= 3 (optionally keeping row numbers as new col)

> myOriginalDf
   value id rle_len
1      1  x       2
2      1  x       2
3      0  x       2
4      0  x       2
5      1  x       1
6      0  x       3
7      0  x       3
8      0  x       3
9      1  y       1
10     0  y       4
11     0  y       4
12     0  y       4
13     0  y       4
14     1  y       2
15     1  y       2
16     0  y       1
like image 80
qwr Avatar answered Nov 03 '22 02:11

qwr


Using data.table, as your question suggests you actually want to, as far I a can see, this is doing what you want

DT <- data.table(myOriginalDf)

# add the original order, so you can't lose it
DT[, orig := .I]

# rle by id, saving the length as a new variables

DT[, rleLength := {rr <- rle(value); rep(rr$length, rr$length)}, by = 'id']

# key by value and length to subset 

setkey(DT, value, rleLength)

# which rows are value = 0 and length > 2

DT[list(0, unique(rleLength[rleLength>2])),nomatch=0]

##    value rleLength id orig
## 1:     0         3  x    6
## 2:     0         3  x    7
## 3:     0         3  x    8
## 4:     0         4  y   10
## 5:     0         4  y   11
## 6:     0         4  y   12
## 7:     0         4  y   13
like image 36
mnel Avatar answered Nov 03 '22 01:11

mnel


Here is an apply statement based on your solution for a vector. It might do what you want.

z <- apply(mydf,1, function(x) {
runs <-  rle(x[is.na(x)==FALSE]) ;
runs$lengths[length(runs$lengths)] > 2 & runs$values[length(runs$lengths)]==0 })

mydf[z,]

#   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# x  1  1  0  0  1  0  0 NA NA   0
like image 29
Mark Miller Avatar answered Nov 03 '22 01:11

Mark Miller


isMidPoint below will identify the middle 0 if there is one.

library(data.table)
myOriginalDf <- data.table(myOriginalDf, key="id")

myOriginalDf[, isMidPoint := FALSE]
myOriginalDf <- myOriginalDf[!is.na(value)][(c(FALSE, !value[-(1:2)], FALSE) & c(!value[-(length(value))], FALSE) & c(FALSE, !value[-length(value)])), isMidPoint := TRUE, by=id]

Explanation:

To find a series of three in a row, you simply need to compare each element from the 2nd to the 2nd-to-last with its neighbor before it and after it.

Since your values are 0 / 1, they are effectively T / F, and this makes it extremely simple to evaluate (assuming there were no NAs).

If v are your values (without NAs), then !v & !v[-1] will be TRUE anywhere where an element and its successor are 0. Add in & !v[-(1:2)] and this will be true wherever you have the middle of a series of three 0s. Notice that this also catches a series of 4+ 0s as well!

Then all that remains is to (1) calculate the above while removing (and accounting for!) any NAs, and (2) sepearate by id value. Fortunately, data.table makes of these a breeze.

Results:

  > myOriginalDf

    row value id isMidPoint
 1:   1     1  x      FALSE
 2:   2     1  x      FALSE
 3:   3     0  x      FALSE
 4:   4     0  x      FALSE
 5:   5     1  x      FALSE
 6:   6     0  x      FALSE
 7:   7     0  x       TRUE  <~~~~
 8:   9     0  x      FALSE
 9:  10     1  x      FALSE
10:  11     0  x      FALSE
11:  12     0  x       TRUE  <~~~~
12:  13     0  x       TRUE  <~~~~
13:  14     0  x       TRUE  <~~~~
14:  15     0  x      FALSE
15:  16     1  y      FALSE
16:  17     0  y      FALSE
17:  18     0  y       TRUE  <~~~~
18:  20     0  y      FALSE
19:  21     1  y      FALSE
20:  22     1  y      FALSE
21:  23     0  y      FALSE
22:  25     0  y       TRUE  <~~~~
23:  27     0  y       TRUE  <~~~~
24:  29     0  y      FALSE
    row value id isMidPoint

EDIT AS PER COMMENTS:

If you want to find the last sequence that is true use:

    max(which(myOriginalDf$isMidpoint))

If you want to know if the last sequence that is true use:

  # Will be TRUE if last possible sequence is 0-0-0
  #   Note, this accounts for NA's as well
  myOriginalDf[!is.na(value), isMidpoint[length(isMidpoint)-1]
like image 22
Ricardo Saporta Avatar answered Nov 03 '22 01:11

Ricardo Saporta