Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kruskal - Wallis p-value matrix for data subsets with R

Consider a dataset Data which has several factor and several numerical continuous variables. Some of these variables, let's say slice_by_1 (with classes "Male", "Female") and slice_by_2 (with classes "Sad", "Neutral", "Happy"), are used to 'slice' data into subsets. For every subset Kruskal-Wallis test should be run on variables length, preasure,pulse each grouped by the other factor variable called compare_by. Is there a quick way in R to accomplish this task and put calculated p values to a matrix?

I used dplyr package to prepare data.

Sample dataset:

library(dplyr)
set.seed(123)
Data <- tbl_df(
   data.frame(
       slice_by_1 = as.factor(rep(c("Male", "Female"), times = 120)),
       slice_by_2 = as.factor(rep(c("Happy", "Neutral", "Sad"), each = 80)),
       compare_by = as.factor(rep(c("blue", "green", "brown"), times = 80)),
       length   = c(sample(1:10, 120, replace=T), sample(5:12, 120, replace=T)),
       pulse    = runif(240, 60, 120),
       preasure = c(rnorm(80,1,2),rnorm(80,1,2.1),rnorm(80,1,3))
   )
   ) %>%
group_by(slice_by_1, slice_by_2)

Let's look at data:

Source: local data frame [240 x 6]
Groups: slice_by_1, slice_by_2

   slice_by_1 slice_by_2 compare_by length     pulse     preasure
1        Male      Happy       blue     10  69.23376  0.508694601
2      Female      Happy      green      1  68.57866 -1.155632020
3        Male      Happy      brown      8 112.72132  0.007031799
4      Female      Happy       blue      3 116.61283  0.383769524
5        Male      Happy      green      7 110.06851 -0.717791526
6      Female      Happy      brown      8 117.62481  2.938658488
7        Male      Happy       blue      9 105.59749  0.735831389
8      Female      Happy      green      2  83.44101  3.881268679
9        Male      Happy      brown      5 101.48334  0.025572561
10     Female      Happy       blue     10  62.87331 -0.715108893
..        ...        ...        ...    ...       ...          ...

An example of desired output:

    Data_subsets    length  preasure     pulse
1     Male_Happy <p-value> <p-value> <p-value>
2   Female_Happy <p-value> <p-value> <p-value>
3   Male_Neutral <p-value> <p-value> <p-value>
4 Female_Neutral <p-value> <p-value> <p-value>
5       Male_Sad <p-value> <p-value> <p-value>
6     Female_Sad <p-value> <p-value> <p-value>
like image 294
GegznaV Avatar asked Aug 28 '15 23:08

GegznaV


People also ask

How do you find the p-value in Kruskal-Wallis?

For each ω , compute the value of of KW statistics, say h(ω). Then count how many times this value of h(ω) is greater or equal to h0. Also count the total number of permutations. Divide, you get the p-value.

Can Kruskal-Wallis test be used for categorical data?

Typically, a Kruskal-Wallis H test is used when you have three or more categorical, independent groups, but it can be used for just two groups (i.e., a Mann-Whitney U test is more commonly used for two groups).

Can Kruskal-Wallis be used with nominal data?

The most common use of the Kruskal–Wallis test is when you have one nominal variable and one measurement variable, an experiment that you would usually analyze using one-way anova, but the measurement variable does not meet the normality assumption of a one-way anova.


1 Answers

You have most of it with the group_by, now you just need to do it:

Data %>%
    do({
        data.frame(
            Data_subsets=paste(.$slice_by_1[[1]], .$slice_by_2[[1]], sep='_'),
            length=kruskal.test(.$length, .$compare_by)$p.value,
            preasure=kruskal.test(.$preasure, .$compare_by)$p.value,
            pulse=kruskal.test(.$pulse, .$compare_by)$p.value,
            stringsAsFactors=FALSE)
    }) %>%
    ungroup() %>%
    select(-starts_with("slice_"))
## Source: local data frame [6 x 4]
##     Data_subsets    length  preasure     pulse
## 1   Female_Happy 0.4369918 0.1937327 0.8767561
## 2 Female_Neutral 0.3750688 0.8588069 0.2858796
## 3     Female_Sad 0.7958502 0.6274940 0.5801208
## 4     Male_Happy 0.3099704 0.6929493 0.3796494
## 5   Male_Neutral 0.4953853 0.2986860 0.2418708
## 6       Male_Sad 0.7159970 0.8528201 0.5686672

You have to do the ungroup() to remove the slice* columns, since group_by columns aren't dropped (I'd like to say "never dropped", but I am not certain of that).

like image 58
r2evans Avatar answered Nov 15 '22 05:11

r2evans