Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summary statistics by two or more factor variables?

Tags:

r

summary

This is best illustrated with an example

str(mtcars)
mtcars$gear <- factor(mtcars$gear, labels=c("three","four","five"))
mtcars$cyl <- factor(mtcars$cyl, labels=c("four","six","eight"))
mtcars$am <- factor(mtcars$am, labels=c("manual","auto")
str(mtcars)
tapply(mtcars$mpg, mtcars$gear, sum)

That gives me the summed mpg per gear. But say I wanted a 3x3 table with gear across the top and cyl down the side, and 9 cells with the bivariate sums in, how would I get that 'smartly'.

I could go.

tapply(mtcars$mpg[mtcars$cyl=="four"], mtcars$gear[mtcars$cyl=="four"], sum)
tapply(mtcars$mpg[mtcars$cyl=="six"], mtcars$gear[mtcars$cyl=="six"], sum)
tapply(mtcars$mpg[mtcars$cyl=="eight"], mtcars$gear[mtcars$cyl=="eight"], sum)

This seems cumbersome.

Then how would I bring a 3rd variable in the mix?

This is somewhat in the space I'm thinking about. Summary statistics using ddply

update This gets me there, but it's not pretty.

aggregate(mpg ~ am+cyl+gear, mtcars,sum)

Cheers

like image 714
nzcoops Avatar asked Apr 19 '12 01:04

nzcoops


People also ask

What are the two types of variables statistics?

Variables may be classified into two main categories: categorical and numeric. Each category is then classified in two subcategories: nominal or ordinal for categorical variables, discrete or continuous for numeric variables.

What are summaries in statistics?

In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in. a measure of location, or central tendency, such as the arithmetic mean.

How do you summarize categorical variables?

One way to summarize a categorical variable is to compute the frequencies of the categories. For further summarization, the frequency of the modal category (most frequent category) is often reported.

What are summary statistics for categorical data?

The basic statistics available for categorical variables are counts and percentages. Number of cases in each cell of the table or number of responses for multiple response sets. If weighting is in effect, this value is the weighted count.

What descriptive statistics use categorical variables?

Descriptive statistics for one categorical variable Descriptive statistics used to analyse data for a single categorical variable include frequencies, percentages, fractions and/or relative frequencies (which are simply frequencies divided by the sample size) obtained from the variable's frequency distribution table.

What are the 3 variables in statistics?

There are three types of categorical variables: binary, nominal, and ordinal variables.


2 Answers

How about this, still using tapply()? It's more versatile than you knew!

with(mtcars, tapply(mpg, list(cyl, gear), sum))
#       three  four five
# four   21.5 215.4 56.4
# six    39.5  79.0 19.7
# eight 180.6    NA 30.8

Or, if you'd like the printed output to be a bit more interpretable:

with(mtcars, tapply(mpg, list("Cylinder#"=cyl, "Gear#"=gear), sum))

If you want to use more than two cross-classifying variables, the idea's exactly the same. The results will then be returned in a 3-or-more-dimensional array:

A <- with(mtcars, tapply(mpg, list(cyl, gear, carb), sum))

dim(A)
# [1] 3 3 6
lapply(1:6, function(i) A[,,i]) # To convert results to a list of matrices

# But eventually, the curse of dimensionality will begin to kick in...
table(is.na(A))
# FALSE  TRUE 
#    12    42 
like image 77
Josh O'Brien Avatar answered Sep 16 '22 12:09

Josh O'Brien


I think the answers already on this question are fantastic options, but I wanted to share an additional option based on the dplyr package (this came up for me because I'm teaching a class right now where we use dplyr for data manipulation, so I wanted to avoid introducing students to specialized base R functions like tapply or aggregate).

You can group on as many variables as you want using the group_by function and then summarize information from these groups with summarize. I think this code is more readable to an R newcomer than the formula-based interface of aggregate, yielding identical results:

library(dplyr)
mtcars %>%
  group_by(am, cyl, gear) %>%
  summarize(mpg=sum(mpg))
#       am   cyl  gear   mpg
#    (dbl) (dbl) (dbl) (dbl)
# 1      0     4     3  21.5
# 2      0     4     4  47.2
# 3      0     6     3  39.5
# 4      0     6     4  37.0
# 5      0     8     3 180.6
# 6      1     4     4 168.2
# 7      1     4     5  56.4
# 8      1     6     4  42.0
# 9      1     6     5  19.7
# 10     1     8     5  30.8

With two variables, you can summarize with one variable on the rows and the other on the columns by adding a call to the spread function from the tidyr package:

library(dplyr)
library(tidyr)
mtcars %>%
  group_by(cyl, gear) %>%
  summarize(mpg=sum(mpg)) %>%
  spread(gear, mpg)
#     cyl     3     4     5
#   (dbl) (dbl) (dbl) (dbl)
# 1     4  21.5 215.4  56.4
# 2     6  39.5  79.0  19.7
# 3     8 180.6    NA  30.8
like image 35
josliber Avatar answered Sep 16 '22 12:09

josliber