Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying a function row-wise for a dataset

Tags:

r

hope am able to explain clearly what I would like to do.

I have a matrix

  Z<-matrix(sample(1:40),ncol=4)

 colnames(Z)<-c("value","A","B","C")

 I would like to apply the following formula to each row in the dataset.


  Process = value - rowmean (A,B,C)
           ------------------------------------
           row-wise Standard deviation (A,B,C)         

I thought of something like calculating everything separately like

Subsettting the data first

   onlyABC<-Z[,1:3]

Then apply the rowMeans to each row

     means<-apply(onlyABC,1,rowMeans)

And similarly compute standard deviation separately using

    deviate<-apply(onlyABC,1,SD)

And then I do not know now how to subtract the value column in matrix 'z' from 'means' and then divide by 'deviate'.

Is there a simpler approach to do this?

As an example applying the formula to the first row will give:

 row1  32-(19+35+4/3)
       --------------
        SD(19+35+4)

Similarly apply the formula to other rows as well and get a vector of size 10 in the end.

like image 489
Paul Avatar asked Oct 15 '13 18:10

Paul


2 Answers

ksd<-apply(Z[,-1],1,sd)
kmean<-rowMeans(Z[,-1])
 Z[,1]<-(Z[,1]-kmean)/ksd
> Z
            value  A  B  C
 [1,]  0.88181533 26  4 31
 [2,] -0.04364358 17 22  7
 [3,]  2.21200505 25 13 18
 [4,]  0.50951017  8 34 40
 [5,]  0.03866223 12  6 23
 [6,] -0.64018440 29 16 30
 [7,] -0.40927275 39 35  9
 [8,] -0.65103077 24  5  1
 [9,]  0.89658092 37 27  3
[10,]  0.26360896 11 10 28
like image 182
Metrics Avatar answered Oct 24 '22 19:10

Metrics


This isn't quite an apply problem, as you want to exclude the first column of each row from the calculation.

The iterative way of doing this is to first create the output vector, and then substitute into it as follows:

tranZ <- vector('numeric', length = nrow(Z))
for (i in 1:nrow(Z)) {
    tranZ[i] <- (Z[i,1] - mean(Z[i,-1])) / sd(Z[i,-1])
}

If you have a large data-set, i suggest using the power of vectorisation -- try the following:

(Z[,1] - rowMeans(Z[,-1])) / apply(Z[, -1], 1, sd)

Or with vapply:

tranZ_v <- vapply(1:nrow(Z), function(X) (Z[X, 1] - mean(Z[X, -1])) / sd(Z[X, -1]),
                FUN.VALUE = numeric(1))

The key to using the *apply family in this case is controlling the application -- to do this i've iterated across 1:nrow(Z) rather than the object itself: calling the object in the function.


Benchmarking

require(rbenchmark)

process <- function(x) {
    (x[["value"]] - mean(c(x[["A"]], x[["B"]], x[["C"]]))) / sd(c(x[["A"]], x[["B"]], x[["C"]]))
}          

p2 <- function(x) {
    (x[1] - mean(x[-1])) / sd(x[-1])
}

apply_fun <- function() apply(Z, 1, process)
apply_fun2 <- function() apply(Z, 1, p2)

apply_sd <- function() (Z[,1] - rowMeans(Z[,-1])) / apply(Z[, -1], 1, sd)

vapply_anon <- function() vapply(1:nrow(Z), FUN = function(X) (Z[X, 1] - mean(Z[X, -1])) / sd(Z[X, -1]),
                FUN.VALUE = numeric(1))


bb <- benchmark(apply_fun(), apply_fun2(), apply_sd(), vapply_anon(), 
          columns = c('test', 'elapsed', 'relative'), 
          replications = 100, 
          order = 'elapsed')

The vectorised approach, using apply for only the sd is fastest:

> bb
           test elapsed relative
3    apply_sd()   0.021    1.000
4 vapply_anon()   0.030    1.429
1   apply_fun()   0.033    1.571
2  apply_fun2()   0.034    1.619
like image 41
ricardo Avatar answered Oct 24 '22 19:10

ricardo