Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R & dplyr to summarize - group_by, count, mean, sd [closed]

Tags:

r

dplyr

summarize

Good day and greetings! This is my first post on Stack Overflow. I am fairly new to R and even newer dplyr. I have a small data set comprised of 2 columns - var1 and var2. The var1 column is comprised of num values. The var2 column is comprised of factors with 3 levels - A, B, and C.

        var1 var2
1  1.4395244    A
2  1.7698225    A
3  3.5587083    A
4  2.0705084    A
5  2.1292877    A
6  3.7150650    B
7  2.4609162    B
8  0.7349388    B
9  1.3131471    B
10 1.5543380    B
11 3.2240818    C
12 2.3598138    C
13 2.4007715    C
14 2.1106827    C
15 1.4441589    C

'data.frame':   15 obs. of  2 variables:
 $ var1: num  1.44 1.77 3.56 2.07 2.13 ...
 $ var2: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 2 2 2 2 2 ...

I am trying to use dplyr to group_by var2 (A, B, and C) then count, and summarize the var1 by mean and sd. The count works but rather than provide the mean and sd for each group, I receive the overall mean and sd next to each group.

To try to resolve the issue, I have conducted multiple internet searches. All results seem to offer a similar syntax to the one I am using. I have also read through all of the recommended posts that Stack Overflow offered prior to posting. Also, I tried restarting R and I made sure that I am not using plyr.

Here is the code that I used to create the data set and the dplyr group_by / summarize.

library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
          "C", "C", "C", "C", "C")
df <- data.frame(var1, var2)
df

df %>%
  group_by(df$var2) %>%
  summarize(
    count = n(),
    mean = mean(df$var1, na.rm = TRUE),
    sd = sd(df$var1, na.rm = TRUE)
  )

Here are the results:

# A tibble: 3 x 4
  `df$var2` count  mean    sd
  <fct>     <int> <dbl> <dbl>
1 A             5  2.15 0.845
2 B             5  2.15 0.845
3 C             5  2.15 0.845

The count appears to work showing a count of 5 for each group. Each group is showing the overall mean and sd for the whole column rather than each group. The expected results are the count, mean, and sd for each group.

I am sure I am overlooking something obvious but I would greatly appreciate any assistance.

Thanks!

like image 666
earlev4 Avatar asked Jul 25 '19 04:07

earlev4


1 Answers

Even though answered via comments, I felt such a nice reproducible example for a very first question deserved an official answer.

library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c(rep("A", 5), rep("B", 5), rep("C", 5))
df <- data.frame(var1, var2) 
df_stat <- df %>% group_by(var2) %>% summarize(
                                      count = n(),
                                       mean = mean(var1, na.rm = TRUE), 
                                         sd = sd(var1, na.rm = TRUE)) 
head(df_stat)
# A tibble: 3 x 4
# var2   count  mean    sd
# <fct>  <int>  <dbl>  <dbl>
# 1 A      5    2.19   0.811
# 2 B      5    1.96   1.16 
# 3 C      5    2.31   0.639
like image 94
dbo Avatar answered Oct 04 '22 19:10

dbo