Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create grouping variable for consecutive sequences and split vector

I have a vector, such as c(1, 3, 4, 5, 9, 10, 17, 29, 30) and I would like to group together the 'neighboring' elements that form a regular, consecutive sequence, i.e. an increase by 1, in a ragged vector resulting in:

L1: 1
L2: 3,4,5
L3: 9,10
L4: 17
L5: 29,30

Naive code (of an ex-C programmer):

partition.neighbors <- function(v)
{
    result <<- list() #jagged array
    currentList <<- v[1] #current series

    for(i in 2:length(v))
    {
        if(v[i] - v [i-1] == 1)
        {
            currentList <<- c(currentList, v[i])
        }
        else
        {
            result <<- c(result, list(currentList))
            currentList <<- v[i] #next series
        }       
    }

    return(result)  
}

Now I understand that

a) R is not C (despite the curly brackets)
b) global variables are pure evil
c) that is a horribly inefficient way of achieving the result

, so any better solutions are welcome.

like image 845
letsrock Avatar asked Mar 07 '11 16:03

letsrock


5 Answers

Making heavy use of some R idioms:

> split(v, cumsum(c(1, diff(v) != 1)))
$`1`
[1] 1

$`2`
[1] 3 4 5

$`3`
[1]  9 10

$`4`
[1] 17

$`5`
[1] 29 30
like image 57
Joshua Ulrich Avatar answered Nov 03 '22 01:11

Joshua Ulrich


daroczig writes "you could write a lot neater code based on diff"...

Here's one way:

split(v, cumsum(diff(c(-Inf, v)) != 1))

EDIT (added timings):

Tommy discovered this could be faster by being careful with types; the reason it got faster is that split is faster on integers, and is actually faster still on factors.

Here's Joshua's solution; the result from the cumsum is a numeric because it's being c'd with 1, so it's the slowest.

system.time({
a <- cumsum(c(1, diff(v) != 1))
split(v, a)
})
#   user  system elapsed 
#  1.839   0.004   1.848 

Just cing with 1L so the result is an integer speeds it up considerably.

system.time({
a <- cumsum(c(1L, diff(v) != 1))
split(v, a)
})
#   user  system elapsed 
#  0.744   0.000   0.746 

This is Tommy's solution, for reference; it's also splitting on an integer.

> system.time({
a <- cumsum(c(TRUE, diff(v) != 1L))
split(v, a)
})
#   user  system elapsed 
#  0.742   0.000   0.746 

Here's my original solution; it also is splitting on an integer.

system.time({
a <- cumsum(diff(c(-Inf, v)) != 1)
split(v, a)
})
#   user  system elapsed 
#  0.750   0.000   0.754 

Here's Joshua's, with the result converted to an integer before the split.

system.time({
a <- cumsum(c(1, diff(v) != 1))
a <- as.integer(a)
split(v, a)
})
#   user  system elapsed 
#  0.736   0.002   0.740 

All the versions that split on an integer vector are about the same; it could be even faster if that integer vector was already a factor, as the conversion from integer to factor actually takes about half the time. Here I make it into a factor directly; this is not recommended in general because it depends on the structure of the factor class. It'ss done here for comparison purposes only.

system.time({
a <- cumsum(c(1L, diff(v) != 1))
a <- structure(a, class = "factor", levels = 1L:a[length(a)])
split(v,a)
})
#   user  system elapsed 
#  0.356   0.000   0.357 
like image 12
Aaron left Stack Overflow Avatar answered Nov 03 '22 02:11

Aaron left Stack Overflow


Joshua and Aaron were spot on. However, their code can still be made more than twice as fast by careful use of the correct types, integers and logicals:

split(v, cumsum(c(TRUE, diff(v) != 1L)))

v <- rep(c(1:5, 19), len = 1e6) # Huge vector...
system.time( split(v, cumsum(c(1, diff(v) != 1))) ) # Joshua's code
# user  system elapsed 
#   2.64    0.00    2.64 

system.time( split(v, cumsum(c(TRUE, diff(v) != 1L))) ) # Modified code
# user  system elapsed 
# 1.09    0.00    1.12 
like image 8
Tommy Avatar answered Nov 03 '22 00:11

Tommy


You could define the cut-points easily:

which(diff(v) != 1)

Based on that try:

v <- c(1,3,4,5,9,10,17,29,30)
cutpoints <- c(0, which(diff(v) != 1), length(v))
ragged.vector <- vector("list", length(cutpoints)-1)
for (i in 2:length(cutpoints)) ragged.vector[[i-1]] <- v[(cutpoints[i-1]+1):cutpoints[i]]

Which results in:

> ragged.vector
[[1]]
[1] 1

[[2]]
[1] 3 4 5

[[3]]
[1]  9 10

[[4]]
[1] 17

[[5]]
[1] 29 30

This algorithm is not a nice one but you could write a lot neater code based on diff :) Good luck!

like image 4
daroczig Avatar answered Nov 03 '22 01:11

daroczig


You can create a data.frame and assign the elements to groups using diff, ifelse and cumsum, then aggregate using tapply:

v.df <- data.frame(v = v)
v.df$group <- cumsum(ifelse(c(1, diff(v) - 1), 1, 0))
tapply(v.df$v, v.df$group, function(x) x)

$`1`
[1] 1

$`2`
[1] 3 4 5

$`3`
[1]  9 10

$`4`
[1] 17

$`5`
[1] 29 30
like image 4
James Avatar answered Nov 03 '22 02:11

James