I have a data frame with one grouping factor (the first column) with multiple levels (more than two) and several columns with data. I want to apply the wilcox.test to the whole date frame to compare the each group variables with the others. How can I do this?
UPDATE: I know that the wilcox.test will only test for difference between two groups and my data frame contains three. But I am interested more in how to do this, than what test to use. Most likely that one group will be removed, but I have not decided yet on that, so I want to test all variants.
Here is a sample:
structure(list(group = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), var1 = c(9.3,
9.05, 7.78, 7.11, 7.14, 8.12, 7.5, 7.84, 7.8, 7.52, 8.84, 6.98,
6.1, 6.89, 6.5, 7.5, 7.8, 5.5, 6.61, 7.65, 7.68), var2 = c(11L,
11L, 10L, 1L, 3L, 7L, 11L, 11L, 11L, 11L, 4L, 1L, 1L, 1L, 2L,
2L, 1L, 4L, 8L, 8L, 1L), var3 = c(7L, 11L, 3L, 7L, 11L, 2L, 11L,
5L, 11L, 11L, 5L, 11L, 11L, 2L, 9L, 9L, 3L, 8L, 11L, 11L, 2L),
var4 = c(11L, 11L, 11L, 11L, 6L, 11L, 11L, 11L, 10L, 7L,
11L, 2L, 11L, 3L, 11L, 11L, 6L, 11L, 1L, 11L, 11L), var5 = c(11L,
1L, 2L, 2L, 11L, 11L, 1L, 10L, 2L, 11L, 1L, 3L, 11L, 11L,
8L, 8L, 11L, 11L, 11L, 2L, 9L)), .Names = c("group", "var1",
"var2", "var3", "var4", "var5"), class = "data.frame", row.names = c(NA,
-21L))
UPDATE
Thanks to everyone for all answers!
Updating my answer to work across columns
test.fun <- function(dat, col) {
c1 <- combn(unique(dat$group),2)
sigs <- list()
for(i in 1:ncol(c1)) {
sigs[[i]] <- wilcox.test(
dat[dat$group == c1[1,i],col],
dat[dat$group == c1[2,i],col]
)
}
names(sigs) <- paste("Group",c1[1,],"by Group",c1[2,])
tests <- data.frame(Test=names(sigs),
W=unlist(lapply(sigs,function(x) x$statistic)),
p=unlist(lapply(sigs,function(x) x$p.value)),row.names=NULL)
return(tests)
}
tests <- lapply(colnames(dat)[-1],function(x) test.fun(dat,x))
names(tests) <- colnames(dat)[-1]
# tests <- do.call(rbind, tests) reprints as data.frame
# This solution is not "slow" and outperforms the other answers significantly:
system.time(
rep(
tests <- lapply(colnames(dat)[-1],function(x) test.fun(dat,x)),10000
)
)
# user system elapsed
# 0.056 0.000 0.053
And the result:
tests
$var1
Test W p
1 Group 1 by Group 2 28 0.36596737
2 Group 1 by Group 3 39 0.05927406
3 Group 2 by Group 3 38 0.27073136
$var2
Test W p
1 Group 1 by Group 2 19.0 0.8205958
2 Group 1 by Group 3 36.5 0.1159945
3 Group 2 by Group 3 40.5 0.1522726
$var3
Test W p
1 Group 1 by Group 2 13.0 0.2425786
2 Group 1 by Group 3 23.5 1.0000000
3 Group 2 by Group 3 41.0 0.1261647
$var4
Test W p
1 Group 1 by Group 2 26 0.4323470
2 Group 1 by Group 3 30 0.3729664
3 Group 2 by Group 3 29 0.9479518
$var5
Test W p
1 Group 1 by Group 2 24.0 0.7100968
2 Group 1 by Group 3 19.0 0.5324295
3 Group 2 by Group 3 17.5 0.2306609
The pairwise.wilcox.test
function seems like it would be useful here; perhaps like this?
out <- lapply(2:6, function(x) pairwise.wilcox.test(d[[x]], d$group))
names(out) <- names(d)[2:6]
out
If you just want the p-values, you can go through and extract those and make a matrix.
sapply(out, function(x) {
p <- x$p.value
n <- outer(rownames(p), colnames(p), paste, sep='v')
p <- as.vector(p)
names(p) <- n
p
})
## var1 var2 var3 var4 var5
## 2v1 0.5414627 0.8205958 0.4851572 1 1.0000000
## 3v1 0.1778222 0.3479835 1.0000000 1 1.0000000
## 2v2 NA NA NA NA NA
## 3v2 0.5414627 0.3479835 0.3784941 1 0.6919826
Also note that pairwise.wilcox.test
adjusts for multiple comparisons using the Holm method; if you'd rather do something different, look at the p.adjust
parameter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With