Consider a dataset Data
which has several factor and several numerical continuous variables. Some of these variables, let's say slice_by_1
(with classes "Male", "Female") and slice_by_2
(with classes "Sad", "Neutral", "Happy"), are used to 'slice' data into subsets. For every subset Kruskal-Wallis test should be run on variables length
, preasure
,pulse
each grouped by the other factor variable called compare_by
. Is there a quick way in R to accomplish this task and put calculated p values to a matrix?
I used dplyr
package to prepare data.
Sample dataset:
library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
slice_by_1 = as.factor(rep(c("Male", "Female"), times = 120)),
slice_by_2 = as.factor(rep(c("Happy", "Neutral", "Sad"), each = 80)),
compare_by = as.factor(rep(c("blue", "green", "brown"), times = 80)),
length = c(sample(1:10, 120, replace=T), sample(5:12, 120, replace=T)),
pulse = runif(240, 60, 120),
preasure = c(rnorm(80,1,2),rnorm(80,1,2.1),rnorm(80,1,3))
)
) %>%
group_by(slice_by_1, slice_by_2)
Let's look at data:
Source: local data frame [240 x 6]
Groups: slice_by_1, slice_by_2
slice_by_1 slice_by_2 compare_by length pulse preasure
1 Male Happy blue 10 69.23376 0.508694601
2 Female Happy green 1 68.57866 -1.155632020
3 Male Happy brown 8 112.72132 0.007031799
4 Female Happy blue 3 116.61283 0.383769524
5 Male Happy green 7 110.06851 -0.717791526
6 Female Happy brown 8 117.62481 2.938658488
7 Male Happy blue 9 105.59749 0.735831389
8 Female Happy green 2 83.44101 3.881268679
9 Male Happy brown 5 101.48334 0.025572561
10 Female Happy blue 10 62.87331 -0.715108893
.. ... ... ... ... ... ...
An example of desired output:
Data_subsets length preasure pulse
1 Male_Happy <p-value> <p-value> <p-value>
2 Female_Happy <p-value> <p-value> <p-value>
3 Male_Neutral <p-value> <p-value> <p-value>
4 Female_Neutral <p-value> <p-value> <p-value>
5 Male_Sad <p-value> <p-value> <p-value>
6 Female_Sad <p-value> <p-value> <p-value>
For each ω , compute the value of of KW statistics, say h(ω). Then count how many times this value of h(ω) is greater or equal to h0. Also count the total number of permutations. Divide, you get the p-value.
Typically, a Kruskal-Wallis H test is used when you have three or more categorical, independent groups, but it can be used for just two groups (i.e., a Mann-Whitney U test is more commonly used for two groups).
The most common use of the Kruskal–Wallis test is when you have one nominal variable and one measurement variable, an experiment that you would usually analyze using one-way anova, but the measurement variable does not meet the normality assumption of a one-way anova.
You have most of it with the group_by
, now you just need to do
it:
Data %>%
do({
data.frame(
Data_subsets=paste(.$slice_by_1[[1]], .$slice_by_2[[1]], sep='_'),
length=kruskal.test(.$length, .$compare_by)$p.value,
preasure=kruskal.test(.$preasure, .$compare_by)$p.value,
pulse=kruskal.test(.$pulse, .$compare_by)$p.value,
stringsAsFactors=FALSE)
}) %>%
ungroup() %>%
select(-starts_with("slice_"))
## Source: local data frame [6 x 4]
## Data_subsets length preasure pulse
## 1 Female_Happy 0.4369918 0.1937327 0.8767561
## 2 Female_Neutral 0.3750688 0.8588069 0.2858796
## 3 Female_Sad 0.7958502 0.6274940 0.5801208
## 4 Male_Happy 0.3099704 0.6929493 0.3796494
## 5 Male_Neutral 0.4953853 0.2986860 0.2418708
## 6 Male_Sad 0.7159970 0.8528201 0.5686672
You have to do the ungroup()
to remove the slice*
columns, since group_by
columns aren't dropped (I'd like to say "never dropped", but I am not certain of that).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With