I have a dataframe (let's call it <code>df</code>), containing n=100 columns (<code>C1</code>, <code>C2</code>,..., <code>C100</code>) and 50 rows (<code>R1</code>, <code>R2</code>,...,<code>R50</code>). I tested all the column in the data frame to be sure they are numeric. I want to know if the data in each column has a normal distribution using the <code>shapiro.test()</code> function. I am able to do it column by colums using the code : <pre class="prettyprint"><code>> shapiro.test(df$Cn) </code></pre> or <pre class="prettyprint"><code>> shapiro.test(df[,c(Cn)]) </code></pre> However, when I try to do it on several columns at the same time it doesn't work : <pre class="prettyprint"><code>> shapiro.test(df[,c(C1:C100)]) </code></pre> returns the error : <blockquote> Error in <code>[.data.frame</code>(x, complete.cases(x)) : undefined columns selected </blockquote> I would appreciate if anyone could suggest a way to do all the tests at the same time, and eventually storing the results in a new dataframe/matrix/list/vector.

Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of <code>sapply()</code> or <code>lapply()</code> (or even <code>apply()</code>, but for data frames, one of the two earlier-mentioned functions would be best). Here is an example, using some dummy data: <pre class="prettyprint"><code>set.seed(42) df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2), Uniform = runif(50)) </code></pre> Now apply the <code>shapiro.test()</code> function. We capture the output in a list (given the object returned by this function) so we will use <code>lapply()</code>. <pre class="prettyprint"><code>lshap <- lapply(df, shapiro.test) lshap[[1]] ## look at the first column results R> lshap[[1]] Shapiro-Wilk normality test data: X[[1L]] W = 0.9802, p-value = 0.5611 </code></pre> You will need to extract the things you want from these objects, which all have the structure: <pre class="prettyprint"><code>R> str(lshap[[1]]) List of 4 $ statistic: Named num 0.98 ..- attr(*, "names")= chr "W" $ p.value : num 0.561 $ method : chr "Shapiro-Wilk normality test" $ data.name: chr "X[[1L]]" - attr(*, "class")= chr "htest" </code></pre> If you want the <code>statistic</code> and <code>p.value</code> components of this object for all elements of <code>lshap</code>, we will use <code>sapply()</code> this time, to nicely arrange the results for us: <pre class="prettyprint"><code>lres <- sapply(lshap, `[`, c("statistic","p.value")) R> lres Gaussian Poisson Uniform statistic 0.9802 0.9371 0.918 p.value 0.5611 0.01034 0.001998 </code></pre> Given that you have 500 of these, I'd transpose <code>lres</code>: <pre class="prettyprint"><code>R> t(lres) statistic p.value Gaussian 0.9802 0.5611 Poisson 0.9371 0.01034 Uniform 0.918 0.001998 </code></pre> If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.

Using shapiro.test on multiple columns in a data frame

Q: What is the minimum sample size for Shapiro-Wilk test?

Testing Normality Using SPSS SPSS provides the K-S (with Lilliefors correction) and the Shapiro-Wilk normality tests and recommends these tests only for a sample size of less than 50 (8).

Q: Do outliers affect Shapiro-Wilk test?

The Shapiro-Wilk test is not especially sensitive to outliers. There are normality tests that focus on outliers, by looking at a combination of skewness and kurtosis, but they are different.

Q: Is the Wilk Shapiro test one sided or two sided?

The Shapiro Wilk test checks if the normal distribution model fits the observations. It is usually the most powerful test for the normality. The test uses only the right-tailed test.

Tags:

function

dataframe

r

statistics

I have a dataframe (let's call it df), containing n=100 columns (C1, C2,..., C100) and 50 rows (R1, R2,...,R50). I tested all the column in the data frame to be sure they are numeric. I want to know if the data in each column has a normal distribution using the shapiro.test() function.

I am able to do it column by colums using the code :

> shapiro.test(df$Cn)

> shapiro.test(df[,c(Cn)])

However, when I try to do it on several columns at the same time it doesn't work :

> shapiro.test(df[,c(C1:C100)])

returns the error :

Error in [.data.frame(x, complete.cases(x)) : undefined columns selected

I would appreciate if anyone could suggest a way to do all the tests at the same time, and eventually storing the results in a new dataframe/matrix/list/vector.

992

asked Jan 20 '14 16:01

Seb Matamoros

2 Answers

Not that I think this is a sensible approach to data analysis, but the underlying issue of applying a function to the columns of a data frame is a general task that can easily be achieved using one of sapply() or lapply() (or even apply(), but for data frames, one of the two earlier-mentioned functions would be best).

Here is an example, using some dummy data:

set.seed(42)
df <- data.frame(Gaussian = rnorm(50), Poisson = rpois(50, 2), 
                 Uniform = runif(50))

Now apply the shapiro.test() function. We capture the output in a list (given the object returned by this function) so we will use lapply().

lshap <- lapply(df, shapiro.test)
lshap[[1]] ## look at the first column results

R> lshap[[1]]

    Shapiro-Wilk normality test

data:  X[[1L]]
W = 0.9802, p-value = 0.5611

You will need to extract the things you want from these objects, which all have the structure:

R> str(lshap[[1]])
List of 4
 $ statistic: Named num 0.98
  ..- attr(*, "names")= chr "W"
 $ p.value  : num 0.561
 $ method   : chr "Shapiro-Wilk normality test"
 $ data.name: chr "X[[1L]]"
 - attr(*, "class")= chr "htest"

If you want the statistic and p.value components of this object for all elements of lshap, we will use sapply() this time, to nicely arrange the results for us:

lres <- sapply(lshap, `[`, c("statistic","p.value"))

R> lres
          Gaussian Poisson Uniform 
statistic 0.9802   0.9371  0.918   
p.value   0.5611   0.01034 0.001998

Given that you have 500 of these, I'd transpose lres:

R> t(lres)
         statistic p.value 
Gaussian 0.9802    0.5611  
Poisson  0.9371    0.01034 
Uniform  0.918     0.001998

If you plan on doing anything with the p-values from this exercise, I suggest you start thinking about how to correct for multiple comparisons before you shoot yourself in the foot with a 30-cal.

148

answered Sep 23 '22 16:09

Gavin Simpson

Use do.call with rbind and lapply for more simple and compact solution:

df <- data.frame(a = rnorm(100), b = rnorm(100), c = rnorm(100))
do.call(rbind, lapply(df, function(x) shapiro.test(x)[c("statistic", "p.value")]))
#>   statistic p.value    
#> a 0.986224  0.3875904  
#> b 0.9894938 0.6238027
#> c 0.9652532 0.009694794

answered Sep 23 '22 16:09

Artem Klevtsov

Related questions
                            
                                Y axis won't start at 0 in ggplot
                            
                                Parsimonious way to add north arrow and scale bar to ggmap
                            
                                Conditionally include a list of child documents in RMarkdown with knitr
                            
                                How to use both starts_with and ends_with at the same time in one select statement?
                            
                                pandoc document conversion failed with error 2
                            
                                tidycensus::get_acs() geography options?
                            
                                Create discrete color bar with varying interval widths and no spacing between legend levels
                            
                                How can I efficiently find the index of a value in a sorted array?
                            
                                What is the equivalent of var_dump() in R?
                            
                                Bind variables in R DBI
                            
                                Remove variable labels attached with foreign/Hmisc SPSS import functions
                            
                                compute means of a group by factor
                            
                                R package lattice won't plot if run using source()
                            
                                Latex and variables in plot label in R?
                            
                                Convert a file encoding using R? (ANSI to UTF-8)
                            
                                How to create thiessen polygons from points using R packages?
                            
                                Scaled/weighted density plot
                            
                                Extract components from mixed model (lme4) formula
                            
                                How to apply function returning data.frames with factors to sequence
                            
                                Find and replace missing values with row mean

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With