Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using n() at the same time as calculating other summary statistics

Tags:

r

dplyr

summary

I am having trouble to prepare a summary table using dplyr based on the data set below:

set.seed(1)
df <- data.frame(rep(sample(c(2012,2016),10, replace = T)),
                 sample(c('Treat','Control'),10,replace = T),
                 runif(10,0,1),
                 runif(10,0,1),
                 runif(10,0,1))

 colnames(df) <- c('Year','Group','V1','V2','V3')

I want to calculate the mean, median, standard deviation and count the number of observations by each combination of Year and Group.

I have successfully used this code to get mean, median and sd:

summary.table = df %>% 
    group_by(Year, Group) %>%
    summarise_all(funs(n(), sd, median, mean))

However, I do not know how to introduce the n() function inside the funs() command. It gave me the counting for V1, V2 and V3. This is quite redundant, since I only want the size of the sample. I have tried introducing

    mutate(N = n()) %>%

before and after the group_by() line, but it did not give me what I wanted.

Any help?


EDIT: I had not made my doubt clear enough. The problem is that the code gives me columns that I do not need, since the number of observations for V1 is sufficient for me.

like image 314
Arthur Carvalho Brito Avatar asked Jul 11 '17 01:07

Arthur Carvalho Brito


People also ask

What does XI stand for in statistics?

The capital letter X denotes the variable. • xi represents the ith value of variable X.


2 Answers

Add the N column before summarizing as an extra grouping column:

library(dplyr)
set.seed(1)

df <- data.frame(Year = rep(sample(c(2012, 2016), 10, replace = TRUE)),
                 Group = sample(c('Treat', 'Control'), 10, replace = TRUE),
                 V1 = runif(10, 0, 1),
                 V2 = runif(10, 0, 1),
                 V3 = runif(10, 0, 1))


df2 <- df %>% 
    group_by(Year, Group) %>% 
    group_by(N = n(), add = TRUE) %>% 
    summarise_all(funs(sd, median, mean))

df2
#> # A tibble: 4 x 12
#> # Groups:   Year, Group [?]
#>    Year   Group     N      V1_sd      V2_sd     V3_sd V1_median V2_median
#>   <dbl>  <fctr> <int>      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
#> 1  2012 Control     2 0.05170954 0.29422635 0.1152669 0.3037848 0.6193239
#> 2  2012   Treat     2 0.51092899 0.08307494 0.1229560 0.5734239 0.5408230
#> 3  2016 Control     3 0.32043716 0.34402222 0.3822026 0.3823880 0.4935413
#> 4  2016   Treat     3 0.37759667 0.29566739 0.1233162 0.3861141 0.6684667
#> # ... with 4 more variables: V3_median <dbl>, V1_mean <dbl>,
#> #   V2_mean <dbl>, V3_mean <dbl>
like image 89
alistaire Avatar answered Nov 15 '22 08:11

alistaire


Are you getting the same error I am:

“Error in n(): function should not be called directly”

If so, there's a stack question on that here that might help: dplyr: "Error in n(): function should not be called directly"

The resolution seems to be detaching plyr where there appears to be a conflict and reloading the dplyr library.

like image 23
Billy Jackson Avatar answered Nov 15 '22 06:11

Billy Jackson