I am having trouble to prepare a summary table using <code>dplyr</code> based on the data set below: <pre class="prettyprint"><code>set.seed(1) df <- data.frame(rep(sample(c(2012,2016),10, replace = T)), sample(c('Treat','Control'),10,replace = T), runif(10,0,1), runif(10,0,1), runif(10,0,1)) colnames(df) <- c('Year','Group','V1','V2','V3') </code></pre> I want to calculate the mean, median, standard deviation and count the number of observations by each combination of <code>Year</code> and <code>Group</code>. I have successfully used this code to get <code>mean</code>, <code>median</code> and <code>sd</code>: <pre class="prettyprint"><code>summary.table = df %>% group_by(Year, Group) %>% summarise_all(funs(n(), sd, median, mean)) </code></pre> However, I do not know how to introduce the <code>n()</code> function inside the <code>funs()</code> command. It gave me the counting for <code>V1</code>, <code>V2</code> and <code>V3</code>. This is quite redundant, since I only want the size of the sample. I have tried introducing <pre class="prettyprint"><code> mutate(N = n()) %>% </code></pre> before and after the <code>group_by()</code> line, but it did not give me what I wanted. Any help? <hr> EDIT: I had not made my doubt clear enough. The problem is that the code gives me columns that I do not need, since the number of observations for <code>V1</code> is sufficient for me.

Add the <code>N</code> column before summarizing as an extra grouping column: <pre class="prettyprint lang-r prettyprint-override"><code>library(dplyr) set.seed(1) df <- data.frame(Year = rep(sample(c(2012, 2016), 10, replace = TRUE)), Group = sample(c('Treat', 'Control'), 10, replace = TRUE), V1 = runif(10, 0, 1), V2 = runif(10, 0, 1), V3 = runif(10, 0, 1)) df2 <- df %>% group_by(Year, Group) %>% group_by(N = n(), add = TRUE) %>% summarise_all(funs(sd, median, mean)) df2 #> # A tibble: 4 x 12 #> # Groups: Year, Group [?] #> Year Group N V1_sd V2_sd V3_sd V1_median V2_median #> <dbl> <fctr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2012 Control 2 0.05170954 0.29422635 0.1152669 0.3037848 0.6193239 #> 2 2012 Treat 2 0.51092899 0.08307494 0.1229560 0.5734239 0.5408230 #> 3 2016 Control 3 0.32043716 0.34402222 0.3822026 0.3823880 0.4935413 #> 4 2016 Treat 3 0.37759667 0.29566739 0.1233162 0.3861141 0.6684667 #> # ... with 4 more variables: V3_median <dbl>, V1_mean <dbl>, #> # V2_mean <dbl>, V3_mean <dbl> </code></pre>

Using n() at the same time as calculating other summary statistics

Tags:

r

dplyr

summary

I am having trouble to prepare a summary table using dplyr based on the data set below:

set.seed(1)
df <- data.frame(rep(sample(c(2012,2016),10, replace = T)),
                 sample(c('Treat','Control'),10,replace = T),
                 runif(10,0,1),
                 runif(10,0,1),
                 runif(10,0,1))

 colnames(df) <- c('Year','Group','V1','V2','V3')

I want to calculate the mean, median, standard deviation and count the number of observations by each combination of Year and Group.

I have successfully used this code to get mean, median and sd:

summary.table = df %>% 
    group_by(Year, Group) %>%
    summarise_all(funs(n(), sd, median, mean))

However, I do not know how to introduce the n() function inside the funs() command. It gave me the counting for V1, V2 and V3. This is quite redundant, since I only want the size of the sample. I have tried introducing

    mutate(N = n()) %>%

before and after the group_by() line, but it did not give me what I wanted.

Any help?

EDIT: I had not made my doubt clear enough. The problem is that the code gives me columns that I do not need, since the number of observations for V1 is sufficient for me.

314

asked Jul 11 '17 01:07

Arthur Carvalho Brito

2 Answers

Add the N column before summarizing as an extra grouping column:

library(dplyr)
set.seed(1)

df <- data.frame(Year = rep(sample(c(2012, 2016), 10, replace = TRUE)),
                 Group = sample(c('Treat', 'Control'), 10, replace = TRUE),
                 V1 = runif(10, 0, 1),
                 V2 = runif(10, 0, 1),
                 V3 = runif(10, 0, 1))


df2 <- df %>% 
    group_by(Year, Group) %>% 
    group_by(N = n(), add = TRUE) %>% 
    summarise_all(funs(sd, median, mean))

df2
#> # A tibble: 4 x 12
#> # Groups:   Year, Group [?]
#>    Year   Group     N      V1_sd      V2_sd     V3_sd V1_median V2_median
#>   <dbl>  <fctr> <int>      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
#> 1  2012 Control     2 0.05170954 0.29422635 0.1152669 0.3037848 0.6193239
#> 2  2012   Treat     2 0.51092899 0.08307494 0.1229560 0.5734239 0.5408230
#> 3  2016 Control     3 0.32043716 0.34402222 0.3822026 0.3823880 0.4935413
#> 4  2016   Treat     3 0.37759667 0.29566739 0.1233162 0.3861141 0.6684667
#> # ... with 4 more variables: V3_median <dbl>, V1_mean <dbl>,
#> #   V2_mean <dbl>, V3_mean <dbl>

answered Nov 15 '22 08:11

alistaire

Are you getting the same error I am:

“Error in n(): function should not be called directly”

If so, there's a stack question on that here that might help: dplyr: "Error in n(): function should not be called directly"

The resolution seems to be detaching plyr where there appears to be a conflict and reloading the dplyr library.

answered Nov 15 '22 06:11

Billy Jackson

Related questions
                            
                                R - plotly error object ... not found
                            
                                Error Installing rJava | Makefile.all:38: recipe for target 'libjri.so' failed
                            
                                How to pipe forward a ggplot object?
                            
                                Are explicit roxygen import from base package needed?
                            
                                execute all R chunks at once from an Rmd document
                            
                                Extract model summaries and store them as a new column
                            
                                Error: Invalid grouping factor specification
                            
                                dplyr Exclude row [duplicate]
                            
                                R: how to filter a timestamp by hour and minute?
                            
                                Set title/header in Shiny Dashboard
                            
                                what is the different between h2o.ensemble and h2o.stack in package h2oEnsemble
                            
                                How to customize title, axis labels, etc. in a plot of a decomposed time series
                            
                                remove vectors which are subsets of other vectors in a list
                            
                                Change font in Wordcloud package R
                            
                                error in plm regression
                            
                                tidyr - spread multiple columns
                            
                                Duplicate hover Info in plotly with ggplot2
                            
                                How to define color of intersection in a Venn diagram?
                            
                                sqldf can't find the data with error "no such table"
                            
                                Flattening lists nested in data.frames

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With