I am trying to use rowSds()
to calculate each rows standard deviation so that I can pick the rows that have high sds to graph.
My data frame is called xx
is like this:
head(xx,1)
Job variable 2012-02-23 2012-02-24 2012-02-25 2012-02-27 2012-02-28 2012-02-29 2012-03-01 2012-03-02 2012-03-03 2012-03-05 2012-03-06 2012-03-07 2012-03-08 2012-03-09 2012-03-10 2012-03-12 2012-03-13 2012-03-14
1 A Duration 152 424 NA 499 320 117 211 363 NA 605 76 309 204 185 NA 25 733 500
2012-03-15 2012-03-16 2012-03-17 2012-03-19 2012-03-20 2012-03-21 2012-03-22 2012-03-23 2012-03-24 2012-03-26 2012-03-27 2012-03-28 2012-03-29 2012-03-30 2012-03-31 2012-04-02 2012-04-03 2012-04-04 2012-04-05 2012-04-06
1 521 601 NA 229 758 421 334 659 NA 419 423 444 289 594 NA 327 533 183 211 235
2012-04-07 2012-04-09 2012-04-10 2012-04-11 2012-04-12 2012-04-13 2012-04-14 2012-04-16 2012-04-17 2012-04-18 2012-04-19 2012-04-20 2012-04-21 2012-04-23 2012-04-24 2012-04-25 2012-04-26 2012-04-27 2012-04-28 2012-04-30
1 NA 225 419 236 218 188 NA 205 547 153 196 200 NA 259 257 208 302 244 NA 806
2012-05-01 2012-05-02 2012-05-03 2012-05-04 2012-05-05 2012-05-07 2012-05-08 2012-05-09 2012-05-10 2012-05-11 2012-05-12 2012-05-14 2012-05-15 2012-05-16 2012-05-17 2012-05-18 2012-05-19 2012-05-21 2012-05-22 2012-05-23
1 402 492 1078 440 NA 382 576 1105 511 368 NA 360 381 1152 718 353 NA 408 413 935
2012-05-24 2012-05-25 2012-05-26 2012-05-28 2012-05-29 2012-05-30 2012-05-31 2012-06-01 2012-06-02 2012-06-04 2012-06-05 2012-06-06 2012-06-07 2012-06-08 2012-06-09 2012-06-11 2012-06-12 2012-06-13 2012-06-14 2012-06-15
1 306 277 NA 253 367 977 557 432 NA 328 521 467 972 1556 NA 386 1394 401 857 857
2012-06-16 2012-06-18 2012-06-19 2012-06-20 2012-06-21 2012-06-22 2012-06-23 2012-06-25 2012-06-26 2012-06-27 2012-06-28 2012-06-29 2012-06-30 2012-07-02 2012-07-03 2012-07-04 2012-07-05 2012-07-06 2012-07-07 2012-07-09
1 NA 1056 324 329 327 325 NA 341 268 231 245 301 NA 283 365 297 310 260 NA 254
2012-07-10 2012-07-11 2012-07-12 2012-07-13 2012-07-14 2012-07-16 2012-07-17 2012-07-18 2012-07-19 2012-07-20 2012-07-21 2012-07-23 2012-07-24 2012-07-25 2012-07-26 2012-07-27 2012-07-28 2012-07-30 2012-07-31 2012-08-01
1 283 395 273 273 NA 278 243 210 356 267 NA 442 483 271 327 271 NA 716 598 577
2012-08-02 2012-08-03 2012-08-06 2012-08-07 2012-08-08 2012-08-09 2012-08-10 2012-08-13 2012-08-14 2012-08-15 2012-08-16 2012-08-17 2012-08-20 2012-08-21 2012-08-22 2012-08-23 2012-08-24 2012-08-27 2012-08-28 2012-08-29
1 345 403 318 522 333 259 404 244 240 288 245 22 738 530 390 648 294 403 381 724
2012-08-30 2012-08-31 2012-09-03 2012-09-04 2012-09-05 2012-09-06 2012-09-07 2012-09-10 2012-09-11 2012-09-12 2012-09-13 2012-09-14 2012-09-17 2012-09-18 2012-09-19 2012-09-20 2012-09-21 2012-09-24 2012-09-25 2012-09-26
1 740 575 558 785 883 501 901 500 285 174 562 1047 603 990 289 173 253 512 236 278
2012-09-27 2012-09-28 2012-10-01 2012-10-02 2012-10-03 2012-10-04 2012-10-05 2012-10-08 2012-10-09 2012-10-10 2012-10-11 1 173 277 217 291 197 308 124 387 369 250 242
I am trying to calculate each rows standard deviation and assinging to sd column name:
xx$sd<-rowSds(xx)
I get this error:
Error in apply(na.omit(as.matrix(x), ...), 1, FUN, ...) :
error in evaluating the argument 'X' in selecting a method for function 'apply': Error in na.omit(as.matrix(x), ...) :
error in evaluating the argument 'object' in selecting a method for function 'na.omit': Error in `colnames<-`(`*tmp*`, value = c("2012-02-23", "2012-02-24", "2012-02-25", :
length of 'dimnames' [2] not equal to array extent
Any ideas how can I omit NA
when calculating the SD? Is my syntax correct?
First, review how a SD of one group is computed: Calculate the difference between each value and the group mean, square those differences, add them up, and divide by the number of degrees of freedom (df), which equals n-1. That value is the variance. Its square root is the SD.
It helps us to compare the sets of data that have the same mean but a different range. The sample standard deviation formula is: s=√1n−1∑ni=1(xi−¯x)2 s = 1 n − 1 ∑ i = 1 n ( x i − x ¯ ) 2 , where ¯x x ¯ is the sample mean and xi x i gives the data observations and n denotes the sample size.
To calculate the variance we use the map() method and mutate the array by assigning (value – mean) ^ 2 to every array item, and then we calculate the sum of the array, and then we divide the sum with the length of the array. To calculate the standard deviation we calculate the square root of the array.
You can use apply
and transform
functions
set.seed(007)
X <- data.frame(matrix(sample(c(10:20, NA), 100, replace=TRUE), ncol=10))
transform(X, SD=apply(X,1, sd, na.rm = TRUE))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 SD
1 NA 12 17 18 19 16 12 13 20 14 3.041381
2 14 12 13 13 14 18 16 17 20 10 3.020302
3 11 19 NA 12 19 19 19 20 12 20 3.865805
4 10 11 20 12 15 17 18 17 18 12 3.496029
5 12 15 NA 14 20 18 16 11 14 18 2.958040
6 19 11 10 20 13 14 17 16 10 16 3.596294
7 14 16 17 15 10 11 15 15 11 16 2.449490
8 NA 10 15 19 19 12 15 15 19 14 3.201562
9 11 NA NA 20 20 14 14 17 14 19 3.356763
10 15 13 14 15 NA 13 15 NA 15 12 1.195229
From ?apply
you can see ...
which allows using optional arguments to FUN, in this case you can use na.rm=TRUE
to omit NA
values.
Using rowSds
from matrixStats package also requires setting na.rm=TRUE
to omit NA
library(matrixStats)
transform(X, SD=rowSds(X, na.rm=TRUE)) # same result as before.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With