This is best illustrated with an example
str(mtcars)
mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five"))
mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight"))
mtcars$am <- factor(mtcars$am, labels=c("manual","auto")
str(mtcars)
tapply(mtcars$mpg, mtcars$gear, sum)
That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'.
I could go.
tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum)
tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum)
tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum)
This seems cumbersome.
Then how would I bring a 3rd variable in the mix?
This is somewhat in the space I'm thinking about. Summary statistics using ddply
update This gets me there, but it's not pretty.
aggregate(mpg ~ am+cyl+gear, mtcars,sum)
Cheers
Variables may be classified into two main categories: categorical and numeric. Each category is then classified in two subcategories: nominal or ordinal for categorical variables, discrete or continuous for numeric variables.
In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in. a measure of location, or central tendency, such as the arithmetic mean.
One way to summarize a categorical variable is to compute the frequencies of the categories. For further summarization, the frequency of the modal category (most frequent category) is often reported.
The basic statistics available for categorical variables are counts and percentages. Number of cases in each cell of the table or number of responses for multiple response sets. If weighting is in effect, this value is the weighted count.
Descriptive statistics for one categorical variable Descriptive statistics used to analyse data for a single categorical variable include frequencies, percentages, fractions and/or relative frequencies (which are simply frequencies divided by the sample size) obtained from the variable's frequency distribution table.
There are three types of categorical variables: binary, nominal, and ordinal variables.
How about this, still using tapply()
? It's more versatile than you knew!
with(mtcars, tapply(mpg, list(cyl, gear), sum))
# three four five
# four 21.5 215.4 56.4
# six 39.5 79.0 19.7
# eight 180.6 NA 30.8
Or, if you'd like the printed output to be a bit more interpretable:
with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))
If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array:
A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum))
dim(A)
# [1] 3 3 6
lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices
# But eventually, the curse of dimensionality will begin to kick in...
table(is.na(A))
# FALSE TRUE
# 12 42
I think the answers already on this question are fantastic options, but I wanted to share an additional option based on the dplyr
package (this came up for me because I'm teaching a class right now where we use dplyr
for data manipulation, so I wanted to avoid introducing students to specialized base R functions like tapply
or aggregate
).
You can group on as many variables as you want using the group_by
function and then summarize information from these groups with summarize
. I think this code is more readable to an R newcomer than the formula-based interface of aggregate
, yielding identical results:
library(dplyr)
mtcars %>%
group_by(am, cyl, gear) %>%
summarize(mpg=sum(mpg))
# am cyl gear mpg
# (dbl) (dbl) (dbl) (dbl)
# 1 0 4 3 21.5
# 2 0 4 4 47.2
# 3 0 6 3 39.5
# 4 0 6 4 37.0
# 5 0 8 3 180.6
# 6 1 4 4 168.2
# 7 1 4 5 56.4
# 8 1 6 4 42.0
# 9 1 6 5 19.7
# 10 1 8 5 30.8
With two variables, you can summarize with one variable on the rows and the other on the columns by adding a call to the spread
function from the tidyr
package:
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl, gear) %>%
summarize(mpg=sum(mpg)) %>%
spread(gear, mpg)
# cyl 3 4 5
# (dbl) (dbl) (dbl) (dbl)
# 1 4 21.5 215.4 56.4
# 2 6 39.5 79.0 19.7
# 3 8 180.6 NA 30.8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With