I am trying to run a normality test (shapiro-wilk) on a dataset, and I want the statistic and the p-value for all columns simultaneously. I have read all the other pages on SO (R: Shapiro test by group won't produce p-values and corrupt data frame warning, Using shapiro.test on multiple columns in a data frame) concerning this, but still cant figure it out. Any help would be appreciated!!
Foe eg, here is the dataset: with one character vector (NVL) and rest numeric and i want to group by NVL (NV/VL).
NVL Var1 Var2 Var3 Var 4 Var 5
1. NV 22.5 26.8 89.2 35.7 100
2. NV 34.7 67.4 29.8 12.4 100
3. NV 68.3 34.5 44.5 23.8 100
4. NV 11.2 55.3 17.5 77.9 100
5. VL 55.6 77.2 59.7 89.6 100
6. VL 60.5 88.7 65.4 99.6 100
7. VL 89.4 87.5 65.9 89.5 100
8. VL 65.4 74.2 75.4 89.5 100
9. VL 81.8 78.5 95.4 92.5 100
Here is the code:
library(dplyr)
normalityVar1<-mydata %>%
group_by(NVL) %>%
summarise(statistic = shapiro.test(Var1)$statistic,
p.value = shapiro.test(Var1)$p.value)
Here is the output:
NVL statistic p.value
<chr> <dbl> <dbl>
1 VL 0.9125239 0.1985486
2 NV 0.8983501 0.2101248
Now, do i edit this code, so that I can get this output for all the variables (Var2, 3, 4 ,5) at the same time? I even tried aggregate and sapply, but I am stuck.
aggregate(formula = Var1 ~ NVL,
data = mydata,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
As you can see, I am able to do this for only one variable! I know i am close, but i just cant figure it out anymore!! Thanks in advance for any help!!
I suggest constructing a long-format ("tidy") dataset and using the tidiverse
functions.
Load packages:
library(dplyr) # for data manipulation functions
library(tidyr) # for data manipulation functions
library(data.table) # for function `fread`
library(broom) # for function `tidy`
Read data into R
:
data <- fread(
"NVL Var1 Var2 Var3 Var4 Var5
NV 22.5 26.8 89.2 35.7 100
NV 34.7 67.4 29.8 12.4 100
NV 68.3 34.5 44.5 23.8 50
NV 11.2 55.3 17.5 77.9 100
VL 55.6 77.2 59.7 89.6 100
VL 60.5 88.7 65.4 99.6 100
VL 89.4 87.5 65.9 89.5 100
VL 65.4 74.2 75.4 89.5 90
VL 81.8 78.5 95.4 92.5 90")
Do the analysis:
# 1. Gather the variables that values should be tested.
# 2. Group by variable with variable names (`variable_name`) and
# by all group variables (in our case `NVL`).
# 3. Do the test for `value` and tidy the result.
# 4. Ungroup (it's a good practice to do this).
# 5. Remove unnecessary information (column `method`).
sw_test_results <- data %>%
gather(key = "variable_name", value = "value", Var1:Var5) %>%
group_by(variable_name, NVL) %>%
do(tidy(shapiro.test(.$value))) %>%
ungroup() %>%
select(-method)
sw_test_results
The results:
# A tibble: 10 x 4
variable_name NVL statistic p.value
<chr> <chr> <dbl> <dbl>
1 Var1 NV 0.931 0.602
2 Var1 VL 0.915 0.498
3 Var2 NV 0.941 0.660
4 Var2 VL 0.874 0.282
5 Var3 NV 0.910 0.480
6 Var3 VL 0.864 0.245
7 Var4 NV 0.900 0.433
8 Var4 VL 0.726 0.0176
9 Var5 NV 0.630 0.00124
10 Var5 VL 0.684 0.00647
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With