Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split data.frame based on condition

Tags:

r

Let's say I have the following data.frame, where pos is a position coordinate. I've included a variable thresh where val is greater than a given threshold t.

set.seed(123)
n <- 20
t <- 0
DF <- data.frame(pos = seq(from = 0, by = 0.3, length.out = n),
                 val = sample(-2:5, size = n, replace = TRUE))
DF$thresh <- DF$val > t
DF

##    pos val thresh
## 1  0.0   0  FALSE
## 2  0.3   4   TRUE
## 3  0.6   1   TRUE
## 4  0.9   5   TRUE
## 5  1.2   5   TRUE
## 6  1.5  -2  FALSE
## 7  1.8   2   TRUE
## 8  2.1   5   TRUE
## 9  2.4   2   TRUE
## 10 2.7   1   TRUE
## 11 3.0   5   TRUE
## 12 3.3   1   TRUE
## 13 3.6   3   TRUE
## 14 3.9   2   TRUE
## 15 4.2  -2  FALSE
## 16 4.5   5   TRUE
## 17 4.8  -1  FALSE
## 18 5.1  -2  FALSE
## 19 5.4   0  FALSE
## 20 5.7   5   TRUE

How could I get region coordinates where val is positive i.e. in the above example:

0.3 - 1.2,
1.8 - 3.9,
4.5 - 4.5,
5.7 - 5.7

I have thought of splitting the data.frame by thresh and then accessing pos from the first and last row of each data.frame list element, but that will just combine all the TRUE and FALSE subsets together. Is there a way to convert the thresh variable into a character based on the TRUE value, and discarding the FALSE values?

split(DF, DF$thresh) # not what I want


## $`FALSE`
##    pos val thresh
## 1  0.0   0  FALSE
## 6  1.5  -2  FALSE
## 15 4.2  -2  FALSE
## 17 4.8  -1  FALSE
## 18 5.1  -2  FALSE
## 19 5.4   0  FALSE
## 
## $`TRUE`
##    pos val thresh
## 2  0.3   4   TRUE
## 3  0.6   1   TRUE
## 4  0.9   5   TRUE
## 5  1.2   5   TRUE
## 7  1.8   2   TRUE
## 8  2.1   5   TRUE
## 9  2.4   2   TRUE
## 10 2.7   1   TRUE
## 11 3.0   5   TRUE
## 12 3.3   1   TRUE
## 13 3.6   3   TRUE
## 14 3.9   2   TRUE
## 16 4.5   5   TRUE
## 20 5.7   5   TRUE

Another clunky thing I tried was cumsum but again it includes false rows:

split(DF, cumsum(DF$thresh == 0)) # not what I want but close to it...


## $`1`
##   pos val thresh
## 1 0.0   0  FALSE
## 2 0.3   4   TRUE
## 3 0.6   1   TRUE
## 4 0.9   5   TRUE
## 5 1.2   5   TRUE
## 
## $`2`
##    pos val thresh
## 6  1.5  -2  FALSE
## 7  1.8   2   TRUE
## 8  2.1   5   TRUE
## 9  2.4   2   TRUE
## 10 2.7   1   TRUE
## 11 3.0   5   TRUE
## 12 3.3   1   TRUE
## 13 3.6   3   TRUE
## 14 3.9   2   TRUE
## 
## $`3`
##    pos val thresh
## 15 4.2  -2  FALSE
## 16 4.5   5   TRUE
## 
## $`4`
##    pos val thresh
## 17 4.8  -1  FALSE
## 
## $`5`
##    pos val thresh
## 18 5.1  -2  FALSE
## 
## $`6`
##    pos val thresh
## 19 5.4   0  FALSE
## 20 5.7   5   TRUE
like image 690
PeterQ Avatar asked Mar 13 '23 06:03

PeterQ


1 Answers

Here is one option with data.table. We create a grouping variable using rleid, subset the 'pos' based on 'thresh' and split.

DT <- setDT(DF)[,pos[thresh] ,.(gr=rleid(thresh))]
split(DT$V1, DT$gr)
#$`2`
#[1] 0.3 0.6 0.9 1.2

#$`4`
#[1] 1.8 2.1 2.4 2.7 3.0 3.3 3.6 3.9

#$`6`
#[1] 4.5

#$`8`
#[1] 5.7

Or we can use rle from base R to create the grouping variable and then split based on that

gr <- inverse.rle(within.list(rle(DF$thresh), values <- seq_along(values)))
with(DF, split(pos[thresh], gr[thresh]))

Or as @thelatemail mentioned, cumsum can also be used for grouping after subsetting using the 'thresh'.

 with(DF, split(pos[thresh],cumsum(!thresh)[thresh]))
like image 71
akrun Avatar answered Mar 24 '23 04:03

akrun