Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard Deviation coming up NA when using summarise() function

I am trying to calculate descriptive statistics for the birthweight data set (birthwt) found in RStudio. However, I'm only interested in a few variables: age, ftv, ptl and lwt.

This is the code I have so far:

library(MASS)
library(dplyr)
data("birthwt")

grouped <- group_by(birthwt, age, ftv, ptl, lwt)

summarise(grouped, 
          mean = mean(bwt),
          median = median(bwt),
          SD = sd(bwt))

It gives me a pretty-printed table but only a limited number of the SD is filled and the rest say NA. I just can't work out why or how to fix it!

like image 692
Angus Avatar asked Jan 04 '18 03:01

Angus


People also ask

How do I summarize a column in R?

Method 1: Using summarise_all() method The summarise_all method in R is used to affect every column of the data frame. The output data frame returns all the columns of the data frame where the specified function is applied over every column.

What does summarise in r?

Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.

What does N () do in R?

The function n() returns the number of observations in a current group.


2 Answers

I stumbled here for another reason and also for me, the answer comes from the docs:

# BEWARE: reusing variables may lead to unexpected results
mtcars %>%
    group_by(cyl) %>%
    summarise(disp = mean(disp), sd = sd(disp))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 3
#>     cyl  disp    sd
#>   <dbl> <dbl> <dbl>
#> 1     4  105.    NA
#> 2     6  183.    NA
#> 3     8  353.    NA

So, in case someone has the same reason as me, instead of reusing a variable, create new ones:

mtcars %>%
group_by(cyl) %>%
summarise(
    disp_mean = mean(disp),
    disp_sd = sd(disp)
)

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
    cyl disp_mean disp_sd
  <dbl>     <dbl>   <dbl>
1     4      105.    26.9
2     6      183.    41.6
3     8      353.    67.8
like image 169
teppo Avatar answered Nov 04 '22 01:11

teppo


The number of rows for some of the groups are 1.

grouped %>% 
     summarise(n = n())
# A tibble: 179 x 5
# Groups: age, ftv, ptl [?]
#     age   ftv   ptl   lwt     n
#   <int> <int> <int> <int> <int>
# 1    14     0     0   135     1
# 2    14     0     1   101     1
# 3    14     2     0   100     1
# 4    15     0     0    98     1
# 5    15     0     0   110     1
# 6    15     0     0   115     1
# 7    16     0     0   110     1
# 8    16     0     0   112     1
# 9    16     0     0   135     2
#10    16     1     0    95     1

According to ?sd,

The standard deviation of a length-one vector is NA.

This results in NA values for the sd where there is only one element

like image 27
akrun Avatar answered Nov 04 '22 00:11

akrun