Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shapiro.test in dplyr on multiple columns at same time

Tags:

r

dplyr

I am trying to run a normality test (shapiro-wilk) on a dataset, and I want the statistic and the p-value for all columns simultaneously. I have read all the other pages on SO (R: Shapiro test by group won't produce p-values and corrupt data frame warning, Using shapiro.test on multiple columns in a data frame) concerning this, but still cant figure it out. Any help would be appreciated!!

Foe eg, here is the dataset: with one character vector (NVL) and rest numeric and i want to group by NVL (NV/VL).

     NVL  Var1  Var2  Var3  Var 4  Var 5
1.   NV   22.5  26.8   89.2  35.7   100
2.   NV   34.7  67.4   29.8  12.4   100
3.   NV   68.3  34.5   44.5  23.8   100
4.   NV   11.2  55.3   17.5  77.9   100
5.   VL   55.6  77.2   59.7  89.6   100
6.   VL   60.5  88.7   65.4  99.6   100
7.   VL   89.4  87.5   65.9  89.5   100
8.   VL   65.4  74.2   75.4  89.5   100
9.   VL   81.8  78.5   95.4  92.5   100

Here is the code:

library(dplyr)
normalityVar1<-mydata %>%
group_by(NVL) %>%
summarise(statistic = shapiro.test(Var1)$statistic, 
p.value = shapiro.test(Var1)$p.value)

Here is the output:

NVL statistic   p.value
  <chr>     <dbl>     <dbl>
1    VL 0.9125239 0.1985486
2    NV 0.8983501 0.2101248

Now, do i edit this code, so that I can get this output for all the variables (Var2, 3, 4 ,5) at the same time? I even tried aggregate and sapply, but I am stuck.

aggregate(formula = Var1 ~ NVL,
data = mydata,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)}) 

As you can see, I am able to do this for only one variable! I know i am close, but i just cant figure it out anymore!! Thanks in advance for any help!!

like image 566
Shades Avatar asked Jan 30 '23 08:01

Shades


1 Answers

I suggest constructing a long-format ("tidy") dataset and using the tidiverse functions.

Load packages:

library(dplyr)      # for data manipulation functions
library(tidyr)      # for data manipulation functions
library(data.table) # for function `fread`
library(broom)      # for function `tidy`

Read data into R:

data <- fread(
"NVL   Var1  Var2   Var3  Var4   Var5
  NV   22.5  26.8   89.2  35.7   100
  NV   34.7  67.4   29.8  12.4   100
  NV   68.3  34.5   44.5  23.8   50
  NV   11.2  55.3   17.5  77.9   100
  VL   55.6  77.2   59.7  89.6   100
  VL   60.5  88.7   65.4  99.6   100
  VL   89.4  87.5   65.9  89.5   100
  VL   65.4  74.2   75.4  89.5   90
  VL   81.8  78.5   95.4  92.5   90")

Do the analysis:

# 1. Gather the variables that values should be tested.
# 2. Group by variable with variable names (`variable_name`) and 
#    by all group variables (in our case `NVL`).
# 3. Do the test for `value` and tidy the result.
# 4. Ungroup (it's a good practice to do this). 
# 5. Remove unnecessary information (column `method`).

sw_test_results <- data %>% 
    gather(key = "variable_name", value = "value", Var1:Var5) %>% 
    group_by(variable_name, NVL)  %>% 
    do(tidy(shapiro.test(.$value))) %>% 
    ungroup() %>% 
    select(-method)

sw_test_results

The results:

# A tibble: 10 x 4
   variable_name NVL   statistic p.value
   <chr>         <chr>     <dbl>   <dbl>
 1 Var1          NV        0.931 0.602  
 2 Var1          VL        0.915 0.498  
 3 Var2          NV        0.941 0.660  
 4 Var2          VL        0.874 0.282  
 5 Var3          NV        0.910 0.480  
 6 Var3          VL        0.864 0.245  
 7 Var4          NV        0.900 0.433  
 8 Var4          VL        0.726 0.0176 
 9 Var5          NV        0.630 0.00124
10 Var5          VL        0.684 0.00647
like image 193
GegznaV Avatar answered Feb 01 '23 08:02

GegznaV