hope am able to explain clearly what I would like to do.
I have a matrix
Z<-matrix(sample(1:40),ncol=4)
colnames(Z)<-c("value","A","B","C")
I would like to apply the following formula to each row in the dataset.
Process = value - rowmean (A,B,C)
------------------------------------
row-wise Standard deviation (A,B,C)
I thought of something like calculating everything separately like
Subsettting the data first
onlyABC<-Z[,1:3]
Then apply the rowMeans to each row
means<-apply(onlyABC,1,rowMeans)
And similarly compute standard deviation separately using
deviate<-apply(onlyABC,1,SD)
And then I do not know now how to subtract the value column in matrix 'z' from 'means' and then divide by 'deviate'.
Is there a simpler approach to do this?
As an example applying the formula to the first row will give:
row1 32-(19+35+4/3)
--------------
SD(19+35+4)
Similarly apply the formula to other rows as well and get a vector of size 10 in the end.
ksd<-apply(Z[,-1],1,sd)
kmean<-rowMeans(Z[,-1])
Z[,1]<-(Z[,1]-kmean)/ksd
> Z
value A B C
[1,] 0.88181533 26 4 31
[2,] -0.04364358 17 22 7
[3,] 2.21200505 25 13 18
[4,] 0.50951017 8 34 40
[5,] 0.03866223 12 6 23
[6,] -0.64018440 29 16 30
[7,] -0.40927275 39 35 9
[8,] -0.65103077 24 5 1
[9,] 0.89658092 37 27 3
[10,] 0.26360896 11 10 28
This isn't quite an apply problem, as you want to exclude the first column of each row from the calculation.
The iterative way of doing this is to first create the output vector, and then substitute into it as follows:
tranZ <- vector('numeric', length = nrow(Z))
for (i in 1:nrow(Z)) {
tranZ[i] <- (Z[i,1] - mean(Z[i,-1])) / sd(Z[i,-1])
}
If you have a large data-set, i suggest using the power of vectorisation -- try the following:
(Z[,1] - rowMeans(Z[,-1])) / apply(Z[, -1], 1, sd)
Or with vapply
:
tranZ_v <- vapply(1:nrow(Z), function(X) (Z[X, 1] - mean(Z[X, -1])) / sd(Z[X, -1]),
FUN.VALUE = numeric(1))
The key to using the *apply
family in this case is controlling the application -- to do this i've iterated across 1:nrow(Z)
rather than the object itself: calling the object in the function.
Benchmarking
require(rbenchmark)
process <- function(x) {
(x[["value"]] - mean(c(x[["A"]], x[["B"]], x[["C"]]))) / sd(c(x[["A"]], x[["B"]], x[["C"]]))
}
p2 <- function(x) {
(x[1] - mean(x[-1])) / sd(x[-1])
}
apply_fun <- function() apply(Z, 1, process)
apply_fun2 <- function() apply(Z, 1, p2)
apply_sd <- function() (Z[,1] - rowMeans(Z[,-1])) / apply(Z[, -1], 1, sd)
vapply_anon <- function() vapply(1:nrow(Z), FUN = function(X) (Z[X, 1] - mean(Z[X, -1])) / sd(Z[X, -1]),
FUN.VALUE = numeric(1))
bb <- benchmark(apply_fun(), apply_fun2(), apply_sd(), vapply_anon(),
columns = c('test', 'elapsed', 'relative'),
replications = 100,
order = 'elapsed')
The vectorised approach, using apply for only the sd
is fastest:
> bb
test elapsed relative
3 apply_sd() 0.021 1.000
4 vapply_anon() 0.030 1.429
1 apply_fun() 0.033 1.571
2 apply_fun2() 0.034 1.619
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With