I have a dataframe (let's call it df
), containing n=100 columns (C1
, C2
,..., C100
) and 50 rows (R1
, R2
,...,R50
). I tested all the column in the data frame to be sure they are numeric. I want to know if the data in each column has a normal distribution using the shapiro.test()
function.
I am able to do it column by colums using the code :
> shapiro.test(df$Cn)
or
> shapiro.test(df[,c(Cn)])
However, when I try to do it on several columns at the same time it doesn't work :
> shapiro.test(df[,c(C1:C100)])
returns the error :
Error in
[.data.frame
(x, complete.cases(x)) : undefined columns selected
I would appreciate if anyone could suggest a way to do all the tests at the same time, and eventually storing the results in a new dataframe/matrix/list/vector.
The Shapiro–Wilk test is more appropriate method for small sample sizes (<50 samples) although it can also be handling on larger sample size while Kolmogorov–Smirnov test is used for n ≥50. For both of the above tests, null hypothesis states that data are taken from normal distributed population.
Testing Normality Using SPSS SPSS provides the K-S (with Lilliefors correction) and the Shapiro-Wilk normality tests and recommends these tests only for a sample size of less than 50 (8).
The Shapiro-Wilk test is not especially sensitive to outliers. There are normality tests that focus on outliers, by looking at a combination of skewness and kurtosis, but they are different.
The Shapiro Wilk test checks if the normal distribution model fits the observations. It is usually the most powerful test for the normality. The test uses only the right-tailed test.
Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of sapply()
or lapply()
(or even apply()
, but for data frames, one of the two earlier-mentioned functions would be best).
Here is an example, using some dummy data:
set.seed(42)
df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2),
Uniform = runif(50))
Now apply the shapiro.test()
function. We capture the output in a list (given the object returned by this function) so we will use lapply()
.
lshap <- lapply(df, shapiro.test)
lshap[[1]] ## look at the first column results
R> lshap[[1]]
Shapiro-Wilk normality test
data: X[[1L]]
W = 0.9802, p-value = 0.5611
You will need to extract the things you want from these objects, which all have the structure:
R> str(lshap[[1]])
List of 4
$ statistic: Named num 0.98
..- attr(*, "names")= chr "W"
$ p.value : num 0.561
$ method : chr "Shapiro-Wilk normality test"
$ data.name: chr "X[[1L]]"
- attr(*, "class")= chr "htest"
If you want the statistic
and p.value
components of this object for all elements of lshap
, we will use sapply()
this time, to nicely arrange the results for us:
lres <- sapply(lshap, `[`, c("statistic","p.value"))
R> lres
Gaussian Poisson Uniform
statistic 0.9802 0.9371 0.918
p.value 0.5611 0.01034 0.001998
Given that you have 500 of these, I'd transpose lres
:
R> t(lres)
statistic p.value
Gaussian 0.9802 0.5611
Poisson 0.9371 0.01034
Uniform 0.918 0.001998
If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.
Use do.call
with rbind
and lapply
for more simple and compact solution:
df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
do.call(rbind, lapply(df, function(x) shapiro.test(x)[c("statistic", "p.value")]))
#> statistic p.value
#> a 0.986224 0.3875904
#> b 0.9894938 0.6238027
#> c 0.9652532 0.009694794
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With