Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting column data in a matrix with resets

Tags:

r

I'm gathering data on how much my cats poop into a matrix:

m <- cbind(fluffy=c(1.1,1.2,1.3,1.4),misterCuddles=c(0.9,NA,1.1,1.0))
row.names(m) <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")

Which gives me this:

           fluffy misterCuddles
2013-01-01    1.1           0.9
2013-01-02    1.2            NA
2013-01-03    1.3           1.1
2013-01-04    1.4           1.0

On every date, I'd like to know how many days in a row each cat has gone number 2. So the resulting matrix should look like this:

           fluffy misterCuddles
2013-01-01      1             1
2013-01-02      2             0
2013-01-03      3             1
2013-01-04      4             2

Is there a way to do this efficiently? The cumsum function does something similar, but that's a primitive so I can't modify it to suit my dirty, dirty needs.

I could run a for loop and store a count like so:

m.output <- matrix(nrow=nrow(m),ncol=ncol(m))
for (column in 1:ncol(m)) {
  sum <- 0
  for (row in 1:nrow(m)) {
    if (is.na(m[row,column])) sum <- 0
    else sum <- sum + 1

    m.output[row,column] <- sum
  }
}

Is this the most efficient way to do this? I have a lot of cats, and I've recorded years worth of poop data. Can I parallellize this by column somehow?

like image 526
dvmlls Avatar asked Dec 10 '13 14:12

dvmlls


2 Answers

All of the answers here are actually too complicated (including my own, from earlier, copied below). The Reduce family of answers is just masking a for-loop in a single function call. I like Roland's and Ananda's, but both I think have a little too much going on.

Thus, here's a simple vectorized solution:

reset <- function(x) {
    s <- seq_along(x)
    s[!is.na(x)] <- 0
    seq_along(x) - cummax(s)
}

> apply(m, 2, reset)
     fluffy misterCuddles
[1,]      1             1
[2,]      2             0
[3,]      3             1
[4,]      4             2

It also works on Roland's example:

m2 <- cbind(fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
           misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA))

> apply(m2, 2, reset)
     fluffy misterCuddles
[1,]      0             0
[2,]      1             1
[3,]      2             2
[4,]      3             0
[5,]      4             0
[6,]      5             1
[7,]      6             0

From earlier: this is not vectorized, but also works:

pooprun <- function(x){
    z <- numeric(length=length(x))
    count <- 0
    for(i in 1:length(x)){
        if(is.na(x[i]))
            count <- 0
        else
            count <- + count + 1
        z[i] <- count
    }
    return(z)
}
apply(m, 2, pooprun)

> apply(m, 2, pooprun)
     fluffy misterCuddles
[1,]      1             1
[2,]      2             0
[3,]      3             1
[4,]      4             2

THE BENCHMARKING

Here I simply wrap everyone's answers in a function call (based on their name).

> library(microbenchmark)
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
         expr     min       lq   median       uq       max neval
     alexis()   1.540   4.6200   5.3890   6.1590   372.185  1000
     hadley()  87.755   92.758   94.298  96.6075  1767.012  1000
     thomas()  92.373  99.6860 102.7655 106.6140   315.223  1000
    matthew() 128.168 136.2505 139.7150 145.4880  5196.344  1000
 thomasloop() 133.556 141.6390 145.1030 150.4920 84131.427  1000
      usobi() 148.182 159.9210 164.7320 174.1620  5010.445  1000
     ananda() 720.507 742.4460 763.6140 801.3335  5858.733  1000

And here are the results for Roland's example data:

> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
         expr     min       lq   median       uq      max neval
     alexis()   2.310   5.3890   6.1590   6.9290   75.438  1000
     hadley()  75.053   78.902   80.058   83.136 1747.767  1000
     thomas()  90.834  97.3770 100.2640 104.3050  358.329  1000
    matthew() 139.715 149.7210 154.3405 161.2680 5084.728  1000
 thomasloop() 144.718 155.4950 159.7280 167.4260 5182.103  1000
      usobi() 177.048 188.5945 194.3680 210.9180 5360.306  1000
     ananda() 705.881 729.9370 753.4150 778.8175 8226.936  1000

Note: Alexis's and Hadley's solutions took quite a while to actually define as functions on my machine, whereas the others work out-of-the-box, but Alexis's is otherwise the clear winner.

like image 88
Thomas Avatar answered Oct 13 '22 15:10

Thomas


This should work. Note that each of your cats is an independent individual so you can turn your data frame into a list and use mclapply which uses a paralleled approach.

count <- function(y,x){
  if(is.na(x)) return(0)
  return (y + 1)
}

oneCat = m[,1]

Reduce(count,oneCat,init=0,accumulate=TRUE)[-1]

EDIT: here is the full answer

count <- function(x,y){
 if(is.na(y)) return(0)
 return (x + 1)
}

mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)

EDIT2: The main bad problem is that I do get extra 0's at the beginning so...

result = mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
finalResult = do.call('cbind',result)[-1,]
rownames(finalResult) = rownames(m)

does the job.

like image 5
Usobi Avatar answered Oct 13 '22 17:10

Usobi