Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently change elements in data based on neighbouring elements

Tags:

r

Let me delve right in. Imagine you have data that looks like this:

 df <- data.frame(one = c(1, 1, NA, 13), 
                  two = c(2, NA,10, 14), 
                three = c(NA,NA,11, NA), 
                 four = c(4, 9, 12, NA))

This gives us:

df
#   one two three four
# 1   1   2    NA    4
# 2   1  NA    NA    9
# 3  NA  10    11   12
# 4  13  14    NA   NA

Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.

Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read

 1 2 2 2 

since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.

The final end result should look like this:

df
#  one two three four
# 1   1   2    2    2
# 2   1   3    3    3
# 3   5   5   11   12
# 4  13  14   NA   NA

I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?

like image 491
coffeinjunky Avatar asked Jun 01 '15 23:06

coffeinjunky


4 Answers

A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:

clean <- function(x) {
  to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
  groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
  x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
    rep(tail(y, 1) / length(y), length(y))
  })
  return(x)
}
t(apply(df, 1, clean))
#      one two three four
# [1,]   1   2     2    2
# [2,]   1   3     3    3
# [3,]   5   5    11   12
# [4,]  13  14    NA   NA

If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:

library(Rcpp)
cppFunction(
"NumericVector cleanRcpp(NumericVector x) {
  const int n = x.size();
  NumericVector y(x);
  int consecNA = 0;
  for (int i=0; i < n; ++i) {
    if (R_IsNA(x[i])) {
      ++consecNA;
    } else if (consecNA > 0) {
      const double replacement = x[i] / (consecNA + 1);
      for (int j=i-consecNA; j <= i; ++j) {
        y[j] = replacement;
      }
      consecNA = 0;
    } else {
      consecNA = 0;
    }
  }
  return y;
}")
t(apply(df, 1, cleanRcpp))
#      one two three four
# [1,]   1   2     2    2
# [2,]   1   3     3    3
# [3,]   5   5    11   12
# [4,]  13  14    NA   NA

We can compare performance on a larger instance (10000 x 100 matrix):

set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
#    user  system elapsed 
#   4.918   0.035   4.992 
system.time(apply(mat, 1, cleanRcpp))
#    user  system elapsed 
#   0.093   0.016   0.120 

In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.

like image 147
josliber Avatar answered Nov 18 '22 20:11

josliber


Here's a base R solution that's nearly as fast as josilber's Rcpp function:

spread_left <- function(df) {
    nc <- ncol(df)
    x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
    ii <- cumsum(!is.na(x))
    f <- tabulate(ii)
    v <- x[!duplicated(ii)]
    xx <- v[ii]/f[ii]
    xx[xx == -Inf] <- NA
    m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
    as.data.frame(m)
}
spread_left(df)
#   one two three four
# 1   1   2     2    2
# 2   1   3     3    3
# 3   5   5    11   12
# 4  13  14    NA   NA

It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.

Here are benchmarks for all currently posted solutions:

library(rbenchmark)
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)

## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean)))) 
# [1] TRUE

## Then compare its speed
benchmark(josilberR     = t(apply(mat, 1, clean)),
          josilberRcpp  = t(apply(mat, 1, cleanRcpp)),
          Josh          = spread_left(df),
          Henrik        = t(apply(df, 1, fn)),
          replications = 10)
#           test replications elapsed relative user.self sys.self
# 4       Henrik           10   38.81   25.201     38.74     0.08
# 3         Josh           10    2.07    1.344      1.67     0.41
# 1    josilberR           10   57.42   37.286     57.37     0.05
# 2 josilberRcpp           10    1.54    1.000      1.44     0.11
like image 29
Josh O'Brien Avatar answered Nov 18 '22 18:11

Josh O'Brien


Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.

fn <- function(x){
  grp <- rev(cumsum(!is.na(rev(x))))
  res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
  res[grp == 0] <- NA
  res
}

t(apply(df, 1, fn))
#      one two three four
# [1,]   1   2     2    2
# [2,]   1   3     3    3
# [3,]   5   5    11   12
# [4,]  13  14    NA   NA
like image 7
Henrik Avatar answered Nov 18 '22 19:11

Henrik


I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing @Henrik's function almost verbatim and converting to long format:

require(data.table) # 1.9.5
fill_naseq <- function(df){

    # switch to long format
    DT  <- data.table(id=(1:nrow(df))*ncol(df),df)
    mDT <- setkey(melt(DT,id.vars="id"),id)
    mDT[,value := as.numeric(value)]

    mDT[,badv  := is.na(value)]     
    mDT[
      # subset to rows that need modification
      badv|shift(badv),
      # apply @Henrik's function, more or less
      value:={
        g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
        ave(value,g,FUN=function(x){n = length(x); x[n]/n})
    }]

    # revert to wide format
    (setDF(dcast(mDT,id~variable)[,id:=NULL]))
}

identical(fill_naseq(df),spread_left(df)) # TRUE

To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:

nr = 1e4
nc = 100
nafreq = 1/1e4

mat <- matrix(sample(
  c(NA,1:3),
  nr*nc, 
  replace=TRUE,
  prob=c(nafreq,rep((1-nafreq)/3,3))
),nrow=nr)
df  <- as.data.frame(mat)

benchmark(F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
#   test replications elapsed relative user.self
# 1    F           10    3.82    1.394      3.72
# 2 Josh           10    2.74    1.000      2.70
# I don't have Rcpp installed and so left off josilber's even faster approach

So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:

DT  <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]

fill_naseq_long <- function(mDT){
    mDT[,badv := is.na(value)]
    mDT[badv|shift(badv),value:={
      g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
      ave(value,g,FUN=function(x){n = length(x); x[n]/n})
    }]
    mDT
}

benchmark(
  F2=fill_naseq_long(mDT),F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
#   test replications elapsed relative user.self
# 2    F           10    3.98    8.468      3.81
# 1   F2           10    0.47    1.000      0.45
# 3 Josh           10    2.72    5.787      2.69

Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".

like image 3
Frank Avatar answered Nov 18 '22 20:11

Frank