Create grouping variable for consecutive sequences and split vector

Question

I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence, i.e. an increase by 1, in a ragged vector resulting in:

L1: 1
L2: 3,4,5
L3: 9,10
L4: 17
L5: 29,30

Naive code (of an ex-C programmer):

partition.neighbors <- function(v)
{
    result <<- list() #jagged array
    currentList <<- v[1] #current series

    for(i in 2:length(v))
    {
        if(v[i] - v [i-1] == 1)
        {
            currentList <<- c(currentList, v[i])
        }
        else
        {
            result <<- c(result, list(currentList))
            currentList <<- v[i] #next series
        }       
    }

    return(result)  
}

Now I understand that

a) R is not C (despite the curly brackets)
b) global variables are pure evil
c) that is a horribly inefficient way of achieving the result

, so any better solutions are welcome.

Joshua Ulrich · Accepted Answer

Making heavy use of some R idioms:

> split(v, cumsum(c(1, diff(v) != 1)))
$`1`
[1] 1

$`2`
[1] 3 4 5

$`3`
[1]  9 10

$`4`
[1] 17

$`5`
[1] 29 30

Aaron left Stack Overflow · Answer

daroczig writes "you could write a lot neater code based on diff"...

Here's one way:

split(v, cumsum(diff(c(-Inf, v)) != 1))

EDIT (added timings):

Tommy discovered this could be faster by being careful with types; the reason it got faster is that split is faster on integers, and is actually faster still on factors.

Here's Joshua's solution; the result from the cumsum is a numeric because it's being c'd with 1, so it's the slowest.

system.time({
a <- cumsum(c(1, diff(v) != 1))
split(v, a)
})
#   user  system elapsed 
#  1.839   0.004   1.848

Just cing with 1L so the result is an integer speeds it up considerably.

system.time({
a <- cumsum(c(1L, diff(v) != 1))
split(v, a)
})
#   user  system elapsed 
#  0.744   0.000   0.746

This is Tommy's solution, for reference; it's also splitting on an integer.

> system.time({
a <- cumsum(c(TRUE, diff(v) != 1L))
split(v, a)
})
#   user  system elapsed 
#  0.742   0.000   0.746

Here's my original solution; it also is splitting on an integer.

system.time({
a <- cumsum(diff(c(-Inf, v)) != 1)
split(v, a)
})
#   user  system elapsed 
#  0.750   0.000   0.754

Here's Joshua's, with the result converted to an integer before the split.

system.time({
a <- cumsum(c(1, diff(v) != 1))
a <- as.integer(a)
split(v, a)
})
#   user  system elapsed 
#  0.736   0.002   0.740

All the versions that split on an integer vector are about the same; it could be even faster if that integer vector was already a factor, as the conversion from integer to factor actually takes about half the time. Here I make it into a factor directly; this is not recommended in general because it depends on the structure of the factor class. It'ss done here for comparison purposes only.

system.time({
a <- cumsum(c(1L, diff(v) != 1))
a <- structure(a, class = "factor", levels = 1L:a[length(a)])
split(v,a)
})
#   user  system elapsed 
#  0.356   0.000   0.357

Tommy · Answer

Joshua and Aaron were spot on. However, their code can still be made more than twice as fast by careful use of the correct types, integers and logicals:

split(v, cumsum(c(TRUE, diff(v) != 1L)))

v <- rep(c(1:5, 19), len = 1e6) # Huge vector...
system.time( split(v, cumsum(c(1, diff(v) != 1))) ) # Joshua's code
# user  system elapsed 
#   2.64    0.00    2.64 

system.time( split(v, cumsum(c(TRUE, diff(v) != 1L))) ) # Modified code
# user  system elapsed 
# 1.09    0.00    1.12

daroczig · Answer

You could define the cut-points easily:

which(diff(v) != 1)

Based on that try:

v <- c(1,3,4,5,9,10,17,29,30)
cutpoints <- c(0, which(diff(v) != 1), length(v))
ragged.vector <- vector("list", length(cutpoints)-1)
for (i in 2:length(cutpoints)) ragged.vector[[i-1]] <- v[(cutpoints[i-1]+1):cutpoints[i]]

Which results in:

> ragged.vector
[[1]]
[1] 1

[[2]]
[1] 3 4 5

[[3]]
[1]  9 10

[[4]]
[1] 17

[[5]]
[1] 29 30

This algorithm is not a nice one but you could write a lot neater code based on diff :) Good luck!

James · Answer

You can create a data.frame and assign the elements to groups using diff, ifelse and cumsum, then aggregate using tapply:

v.df <- data.frame(v = v)
v.df$group <- cumsum(ifelse(c(1, diff(v) - 1), 1, 0))
tapply(v.df$v, v.df$group, function(x) x)

$`1`
[1] 1

$`2`
[1] 3 4 5

$`3`
[1]  9 10

$`4`
[1] 17

$`5`
[1] 29 30

Create grouping variable for consecutive sequences and split vector

Tags:

r

vector

sequence

data-partitioning

letsrock

5 Answers

Joshua Ulrich

Aaron left Stack Overflow

Tommy

daroczig

James

Recent Activity

Donate For Us

Create grouping variable for consecutive sequences and split vector

Tags:

r

vector

sequence

data-partitioning

letsrock

5 Answers

Joshua Ulrich

Aaron left Stack Overflow

Tommy

daroczig

James

Related questions

Recent Activity

Donate For Us