This is best illustrated with an example <pre class="prettyprint"><code>str(mtcars) mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five")) mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight")) mtcars$am <- factor(mtcars$am, labels=c("manual","auto") str(mtcars) tapply(mtcars$mpg, mtcars$gear, sum) </code></pre> That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'. I could go. <pre class="prettyprint"><code>tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum) tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum) tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum) </code></pre> This seems cumbersome. Then how would I bring a 3rd variable in the mix? This is somewhat in the space I'm thinking about. Summary statistics using ddply update This gets me there, but it's not pretty. <pre class="prettyprint"><code>aggregate(mpg ~ am+cyl+gear, mtcars,sum) </code></pre> Cheers

How about this, still using <code>tapply()</code>? It's more versatile than you knew! <pre class="prettyprint"><code>with(mtcars, tapply(mpg, list(cyl, gear), sum)) # three four five # four 21.5 215.4 56.4 # six 39.5 79.0 19.7 # eight 180.6 NA 30.8 </code></pre> Or, if you'd like the printed output to be a bit more interpretable: <pre class="prettyprint"><code>with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum)) </code></pre> <hr> If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array: <pre class="prettyprint"><code>A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum)) dim(A) # [1] 3 3 6 lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices # But eventually, the curse of dimensionality will begin to kick in... table(is.na(A)) # FALSE TRUE # 12 42 </code></pre>

I think the answers already on this question are fantastic options, but I wanted to share an additional option based on the <code>dplyr</code> package (this came up for me because I'm teaching a class right now where we use <code>dplyr</code> for data manipulation, so I wanted to avoid introducing students to specialized base R functions like <code>tapply</code> or <code>aggregate</code>). You can group on as many variables as you want using the <code>group_by</code> function and then summarize information from these groups with <code>summarize</code>. I think this code is more readable to an R newcomer than the formula-based interface of <code>aggregate</code>, yielding identical results: <pre class="prettyprint"><code>library(dplyr) mtcars %>% group_by(am, cyl, gear) %>% summarize(mpg=sum(mpg)) # am cyl gear mpg # (dbl) (dbl) (dbl) (dbl) # 1 0 4 3 21.5 # 2 0 4 4 47.2 # 3 0 6 3 39.5 # 4 0 6 4 37.0 # 5 0 8 3 180.6 # 6 1 4 4 168.2 # 7 1 4 5 56.4 # 8 1 6 4 42.0 # 9 1 6 5 19.7 # 10 1 8 5 30.8 </code></pre> With two variables, you can summarize with one variable on the rows and the other on the columns by adding a call to the <code>spread</code> function from the <code>tidyr</code> package: <pre class="prettyprint"><code>library(dplyr) library(tidyr) mtcars %>% group_by(cyl, gear) %>% summarize(mpg=sum(mpg)) %>% spread(gear, mpg) # cyl 3 4 5 # (dbl) (dbl) (dbl) (dbl) # 1 4 21.5 215.4 56.4 # 2 6 39.5 79.0 19.7 # 3 8 180.6 NA 30.8 </code></pre>

Summary statistics by two or more factor variables?

Tags:

r

summary

This is best illustrated with an example

str(mtcars)
mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five"))
mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight"))
mtcars$am <- factor(mtcars$am, labels=c("manual","auto")
str(mtcars)
tapply(mtcars$mpg, mtcars$gear, sum)

That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'.

I could go.

tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum)
tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum)
tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum)

This seems cumbersome.

Then how would I bring a 3rd variable in the mix?

This is somewhat in the space I'm thinking about. Summary statistics using ddply

update This gets me there, but it's not pretty.

aggregate(mpg ~ am+cyl+gear, mtcars,sum)

Cheers

714

asked Apr 19 '12 01:04

nzcoops

2 Answers

How about this, still using tapply()? It's more versatile than you knew!

with(mtcars, tapply(mpg, list(cyl, gear), sum))
#       three  four five
# four   21.5 215.4 56.4
# six    39.5  79.0 19.7
# eight 180.6    NA 30.8

Or, if you'd like the printed output to be a bit more interpretable:

with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))

If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array:

A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum))

dim(A)
# [1] 3 3 6
lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices

# But eventually, the curse of dimensionality will begin to kick in...
table(is.na(A))
# FALSE  TRUE 
#    12    42

answered Sep 16 '22 12:09

Josh O'Brien

I think the answers already on this question are fantastic options, but I wanted to share an additional option based on the dplyr package (this came up for me because I'm teaching a class right now where we use dplyr for data manipulation, so I wanted to avoid introducing students to specialized base R functions like tapply or aggregate).

You can group on as many variables as you want using the group_by function and then summarize information from these groups with summarize. I think this code is more readable to an R newcomer than the formula-based interface of aggregate, yielding identical results:

library(dplyr)
mtcars %>%
  group_by(am, cyl, gear) %>%
  summarize(mpg=sum(mpg))
#       am   cyl  gear   mpg
#    (dbl) (dbl) (dbl) (dbl)
# 1      0     4     3  21.5
# 2      0     4     4  47.2
# 3      0     6     3  39.5
# 4      0     6     4  37.0
# 5      0     8     3 180.6
# 6      1     4     4 168.2
# 7      1     4     5  56.4
# 8      1     6     4  42.0
# 9      1     6     5  19.7
# 10     1     8     5  30.8

With two variables, you can summarize with one variable on the rows and the other on the columns by adding a call to the spread function from the tidyr package:

library(dplyr)
library(tidyr)
mtcars %>%
  group_by(cyl, gear) %>%
  summarize(mpg=sum(mpg)) %>%
  spread(gear, mpg)
#     cyl     3     4     5
#   (dbl) (dbl) (dbl) (dbl)
# 1     4  21.5 215.4  56.4
# 2     6  39.5  79.0  19.7
# 3     8 180.6    NA  30.8

answered Sep 16 '22 12:09

josliber

Related questions
                            
                                Collapsing rows where some are all NA, others are disjoint with some NAs
                            
                                Find fastest way to get all intervals between identical elements in a vector
                            
                                Random sampling to give an exact sum
                            
                                calculate accuracy and precision of confusion matrix in R
                            
                                How to drop unused levels after filtering by factor? [duplicate]
                            
                                How to access and edit Rprofile? [closed]
                            
                                Dplyr mutate new column at a specified location
                            
                                R install packages from Shell
                            
                                R: how to check how many cores/CPU usage available
                            
                                RMarkdown collapsible panel
                            
                                Rcpp function check if missing value
                            
                                ggplot2 change axis limits for each individual facet panel
                            
                                Avoid quotation marks in column and row names when using write.table [duplicate]
                            
                                How to prevent merge from reordering columns
                            
                                r package KernSmooth copyright
                            
                                How to automatically adjust the width of each facet for facet_wrap?
                            
                                Making a bar chart in ggplot with vertical labels in x axis
                            
                                ggplot format italic annotation
                            
                                How can I plot the residuals of lm() with ggplot?
                            
                                How can I remove non-numeric characters from strings using gsub in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With