Problem: Given an atomic vector, find the start and end indices of runs in the vector. Example vector with runs: <pre class="prettyprint"><code>x = rev(rep(6:10, 1:5)) # [1] 10 10 10 10 10 9 9 9 9 8 8 8 7 7 6 </code></pre> Output from <code>rle()</code>: <pre class="prettyprint"><code>rle(x) # Run Length Encoding # lengths: int [1:5] 5 4 3 2 1 # values : int [1:5] 10 9 8 7 6 </code></pre> Desired output: <pre class="prettyprint"><code># start end # 1 1 5 # 2 6 9 # 3 10 12 # 4 13 14 # 5 15 15 </code></pre> The base <code>rle</code> class doesn't appear to provide this functionality, but the class <code>Rle</code> and function <code>rle2</code> do. However, given how minor the functionality is, sticking to base R seems more sensible than installing and loading additional packages. There are examples of code snippets (here, here and on SO) which solve the slightly different problem of finding start and end indices for runs which satisfy some condition. I wanted something that would be more general, could be performed in one line, and didn't involve the assignment of temporary variables or values. Answering my own question because I was frustrated by the lack of search results. I hope this helps somebody!

A <code>data.table</code> possibility, where <code>.I</code> and <code>.N</code> are used to pick relevant indices, per group defined by <code>rleid</code> runs. <pre class="prettyprint"><code>library(data.table) data.table(x)[ , .(start = .I[1], end = .I[.N]), by = rleid(x)][, rleid := NULL][] # start end # 1: 1 5 # 2: 6 9 # 3: 10 12 # 4: 13 14 # 5: 15 15 </code></pre>

Core logic: <pre class="prettyprint"><code># Example vector and rle object x = rev(rep(6:10, 1:5)) rle_x = rle(x) # Compute endpoints of run end = cumsum(rle_x$lengths) start = c(1, lag(end)[-1] + 1) # Display results data.frame(start, end) # start end # 1 1 5 # 2 6 9 # 3 10 12 # 4 13 14 # 5 15 15 </code></pre> Tidyverse/<code>dplyr</code> way (data frame-centric): <pre class="prettyprint"><code>library(dplyr) rle(x) %>% unclass() %>% as.data.frame() %>% mutate(end = cumsum(lengths), start = c(1, dplyr::lag(end)[-1] + 1)) %>% magrittr::extract(c(1,2,4,3)) # To re-order start before end for display </code></pre> Because the <code>start</code> and <code>end</code> vectors are the same length as the <code>values</code> component of the <code>rle</code> object, solving the related problem of identifying endpoints for runs meeting some condition is straightforward: <code>filter</code> or subset the <code>start</code> and <code>end</code> vectors using the condition on the run values.

Find start and end positions/indices of runs/consecutive values

Tags:

r

vector

run-length-encoding

Problem: Given an atomic vector, find the start and end indices of runs in the vector.

Example vector with runs:

x = rev(rep(6:10, 1:5))
# [1] 10 10 10 10 10  9  9  9  9  8  8  8  7  7  6

Output from rle():

rle(x)
# Run Length Encoding
#  lengths: int [1:5] 5 4 3 2 1
#  values : int [1:5] 10 9 8 7 6

Desired output:

#   start end
# 1     1   5
# 2     6   9
# 3    10  12
# 4    13  14
# 5    15  15

The base rle class doesn't appear to provide this functionality, but the class Rle and function rle2 do. However, given how minor the functionality is, sticking to base R seems more sensible than installing and loading additional packages.

There are examples of code snippets (here, here and on SO) which solve the slightly different problem of finding start and end indices for runs which satisfy some condition. I wanted something that would be more general, could be performed in one line, and didn't involve the assignment of temporary variables or values.

Answering my own question because I was frustrated by the lack of search results. I hope this helps somebody!

511

asked May 09 '17 16:05

Clara

2 Answers

A data.table possibility, where .I and .N are used to pick relevant indices, per group defined by rleid runs.

library(data.table)
data.table(x)[ , .(start = .I[1], end = .I[.N]), by = rleid(x)][, rleid := NULL][]
#    start end
# 1:     1   5
# 2:     6   9
# 3:    10  12
# 4:    13  14
# 5:    15  15

184

answered Oct 18 '22 18:10

Henrik

Core logic:

# Example vector and rle object
x = rev(rep(6:10, 1:5))
rle_x = rle(x)

# Compute endpoints of run
end = cumsum(rle_x$lengths)
start = c(1, lag(end)[-1] + 1)

# Display results
data.frame(start, end)
#   start end
# 1     1   5
# 2     6   9
# 3    10  12
# 4    13  14
# 5    15  15

Tidyverse/dplyr way (data frame-centric):

library(dplyr)

rle(x) %>%
  unclass() %>%
  as.data.frame() %>%
  mutate(end = cumsum(lengths),
         start = c(1, dplyr::lag(end)[-1] + 1)) %>%
  magrittr::extract(c(1,2,4,3)) # To re-order start before end for display

Because the start and end vectors are the same length as the values component of the rle object, solving the related problem of identifying endpoints for runs meeting some condition is straightforward: filter or subset the start and end vectors using the condition on the run values.

answered Oct 18 '22 18:10

Clara

Related questions
                            
                                How can I prevent rbind() from geting really slow as dataframe grows larger?
                            
                                Difficulty fitting gamma distribution with R
                            
                                Creating a new column filled with random numbers
                            
                                Calculate rolling correlation using rollapply
                            
                                How to perform multi-class classification using 'svm' of e1071 package in R
                            
                                error: could not find function install_github for R version 2.15.2
                            
                                R Shiny - Access an App on my Local Machine
                            
                                Add data to ggvis tooltip that's contained in the input dataset but not directly in the vis
                            
                                dplyr: using filter, group_by, from within mutate command [duplicate]
                            
                                R: Using rvest package instead of XML package to get links from URL
                            
                                What's the opposite function to lag for an R vector/dataframe?
                            
                                Skip comment line in csv file using R
                            
                                Split a file path into folder names vector
                            
                                Issue with geom_text when using position_dodge
                            
                                Variation on "How to plot decision boundary of a k-nearest neighbor classifier from Elements of Statistical Learning?"
                            
                                R: Generate data from a probability density distribution
                            
                                Plotting expression trees in R
                            
                                R: Assign values to a new column based on values of another column where a condition is satisfied
                            
                                pandas equivalent for R dcast
                            
                                Extracting unique values from data frame using R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find start and end positions/indices of runs/consecutive values

Tags:

r

vector

run-length-encoding

Clara

People also ask

2 Answers

Henrik

Clara

Recent Activity

Donate For Us