I have a data.frame really big (actually a data.table). Now, to simplify things, let's assume my data.frame is just as follow:
x <- c(1, 1, 0, 0, 1, 0, 0, NA, NA, 0)
y <- c(1 ,0 ,NA, NA, 0, 0, 0, 1, 1, 0)
mydf <- data.frame(rbind(x,y))
I'd like to identify in which row (if any) the last sequence is formed by three consecutive zeros, not considering NAs. So, in the example above, the first row has three consecutive zeros in the last sequence, but not the second one.
I know how to do that if only I have a vector (not a data.frame):
runs <- rle(x[is.na(x)==F])
runs$lengths[length(runs$lengths)] > 2 & runs$values[length(runs$lengths)]==0
I obviously can do a loop and I'll have what I want. But it'll be incredibly inefficient and my actual data.frame is quite big. So, any ideas on how to do in a fastest way?
I guess apply can be useful, but I'm not able to thinking of using it right now. Also, maybe there is a data.table way of doing this?
ps.: Actually, this data.frame is a reshaped version of my original data.table. If somehow I can do the job with the data.frame in the original format, it's ok. To see how is my data.frame originally, just think of it as:
x <- c(1, 1, 0, 0, 1, 0, 0, 0)
y <- c(1 ,0 , 0, 0, 0, 1, 1, 0)
myOriginalDf <- data.frame(value=c(x,y), id=rep(c('x','y'), c(length(x), length(y))))
Base R solution based on rle
, which repeats each length count that many times:
rle_lens <- rle(myOriginalDf$value)$lengths
myOriginalDf$rle_len <- unlist(lapply(1:length(rle_lens), function(i) rep(rle_lens[i], rle_lens[i])))
Then you can subset rows in which value == 0 & rle_len >= 3
(optionally keeping row numbers as new col)
> myOriginalDf
value id rle_len
1 1 x 2
2 1 x 2
3 0 x 2
4 0 x 2
5 1 x 1
6 0 x 3
7 0 x 3
8 0 x 3
9 1 y 1
10 0 y 4
11 0 y 4
12 0 y 4
13 0 y 4
14 1 y 2
15 1 y 2
16 0 y 1
Using data.table
, as your question suggests you actually want to, as far I a can see, this is doing what you want
DT <- data.table(myOriginalDf)
# add the original order, so you can't lose it
DT[, orig := .I]
# rle by id, saving the length as a new variables
DT[, rleLength := {rr <- rle(value); rep(rr$length, rr$length)}, by = 'id']
# key by value and length to subset
setkey(DT, value, rleLength)
# which rows are value = 0 and length > 2
DT[list(0, unique(rleLength[rleLength>2])),nomatch=0]
## value rleLength id orig
## 1: 0 3 x 6
## 2: 0 3 x 7
## 3: 0 3 x 8
## 4: 0 4 y 10
## 5: 0 4 y 11
## 6: 0 4 y 12
## 7: 0 4 y 13
Here is an apply statement based on your solution for a vector. It might do what you want.
z <- apply(mydf,1, function(x) {
runs <- rle(x[is.na(x)==FALSE]) ;
runs$lengths[length(runs$lengths)] > 2 & runs$values[length(runs$lengths)]==0 })
mydf[z,]
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
# x 1 1 0 0 1 0 0 NA NA 0
isMidPoint
below will identify the middle 0
if there is one.
library(data.table)
myOriginalDf <- data.table(myOriginalDf, key="id")
myOriginalDf[, isMidPoint := FALSE]
myOriginalDf <- myOriginalDf[!is.na(value)][(c(FALSE, !value[-(1:2)], FALSE) & c(!value[-(length(value))], FALSE) & c(FALSE, !value[-length(value)])), isMidPoint := TRUE, by=id]
To find a series of three in a row, you simply need to compare each element from the 2nd to the 2nd-to-last with its neighbor before it and after it.
Since your values are 0 / 1
, they are effectively T / F
, and this
makes it extremely simple to evaluate (assuming there were no NAs).
If v
are your values (without NAs), then !v & !v[-1]
will be TRUE anywhere
where an element and its successor are 0. Add in & !v[-(1:2)]
and this will
be true wherever you have the middle of a series of three 0s
.
Notice that this also catches a series of 4+ 0s
as well!
Then all that remains is to (1) calculate the above while removing (and accounting for!) any NAs, and (2) sepearate by id value. Fortunately, data.table
makes of these a breeze.
> myOriginalDf
row value id isMidPoint
1: 1 1 x FALSE
2: 2 1 x FALSE
3: 3 0 x FALSE
4: 4 0 x FALSE
5: 5 1 x FALSE
6: 6 0 x FALSE
7: 7 0 x TRUE <~~~~
8: 9 0 x FALSE
9: 10 1 x FALSE
10: 11 0 x FALSE
11: 12 0 x TRUE <~~~~
12: 13 0 x TRUE <~~~~
13: 14 0 x TRUE <~~~~
14: 15 0 x FALSE
15: 16 1 y FALSE
16: 17 0 y FALSE
17: 18 0 y TRUE <~~~~
18: 20 0 y FALSE
19: 21 1 y FALSE
20: 22 1 y FALSE
21: 23 0 y FALSE
22: 25 0 y TRUE <~~~~
23: 27 0 y TRUE <~~~~
24: 29 0 y FALSE
row value id isMidPoint
If you want to find the last sequence that is true use:
max(which(myOriginalDf$isMidpoint))
If you want to know if the last sequence that is true use:
# Will be TRUE if last possible sequence is 0-0-0
# Note, this accounts for NA's as well
myOriginalDf[!is.na(value), isMidpoint[length(isMidpoint)-1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With