Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bucketing in R or SQL

I am completely stumped on a problem and would like some guidance. I am picking random sets of 8 numbers from the set of 1 to 8 (for example, 5,6,8,1,3,4,2,7) and trying to bucket those numbers as subsets of sequential numbers according to the order they appear.

For the example above, the first bucket would start with a 5 then the 6 would be added. Upon hitting the 8 a new bucket would be started. Whenever we get to a number that belongs in an existing bucket (e.g., when we reach 2, it can be added to 1's bucket), we add it there. In this example, after all 8 numbers we'd arrive at:

5,6,7
8
1,2
3,4

For a total of 4 buckets.

I am not actually concerned with the contents of the buckets, I just want to count how many buckets there are for a given random set of 8 digits. I plan on looping through a set of 1000 of these 8 digit sequences.

like image 562
Turtle Time Avatar asked Jul 16 '15 18:07

Turtle Time


2 Answers

My solution, not ripped of from nongkrong but quite similar. You get the count of buckets:

x <- as.integer(c(5,6,8,1,3,4,2,7))
sum(is.na(sapply(1:length(x), function(i) which((x[i]-1L)==x[1:i])[1L])))
# [1] 4

I believe it is possible to vectorize it, then it would scale perfectly.

like image 85
jangorecki Avatar answered Oct 03 '22 08:10

jangorecki


If you are just interested in the number of buckets,

## Your data
dat <- c( 5,6,8,1,3,4,2,7)

## Get the number of buckets
count <- 0
for (i in seq_along(dat))
    if (!((dat[i] - 1) %in% dat[1:i])) count <- count+1
count
# 4

and, more succinctly in a function

countBuckets <- function(lst) sum(sapply(1:length(lst), function(i)
    (!((lst[i]-1) %in% lst[1:i]))))

And, here is a recursive implementation to get the contents of buckets.

f <- function(lst, acc=NULL) {
    if (length(lst) == 0) return(acc)
    if (missing(acc)) return( Recall(lst[-1], list(lst[1])) )

    diffs <- sapply(acc, function(x) lst[1] - x[length(x)] == 1)
    if (any(diffs)) {
        acc[[which(diffs)]] <- c(acc[[which(diffs)]], lst[1])
    } else { acc <- c(acc, lst[1]) }
    return ( Recall(lst[-1], acc) )
}

f(dat)

# [[1]]
# [1] 5 6 7
# 
# [[2]]
# [1] 8
# 
# [[3]]
# [1] 1 2
# 
# [[4]]
# [1] 3 4
like image 35
Rorschach Avatar answered Oct 03 '22 08:10

Rorschach