I am trying to create one table that summarizes several categorical variables (using frequencies and proportions) by another variable. I would like to do this using the dplyr package.
These previous Stack Overflow discussions have partially what I am looking for: Relative frequencies / proportions with dplyr and Calculate relative frequency for a certain group.
Using the mtcars dataset, this is what the output would look like if I just wanted to look at the proportion of gear
by am
category:
mtcars %>%
group_by(am, gear) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
# am gear n freq
# 1 0 3 15 0.7894737
# 2 0 4 4 0.2105263
# 3 1 4 8 0.6153846
# 4 1 5 5 0.3846154
However, I actually want to look at not only the gears
by am
, but also carb
by am
and cyl
by am
, separately, in the same table. If I amend the code to:
mtcars %>%
group_by (am, gear, carb, cyl) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
I get the frequencies for each combination of am
, gear
, carb
, and cyl
. Which is not what I want. Is there any way to do this with dplyr?
EDIT
Also, it would be an added bonus if anyone knew of a way to produce the table I want, but with the categories of am
as the columns (as in a classic 2x2 table format). Here is an example of what i'm referring to. It is from one of my previous publications. I want to produce this table in R, so that I can output it directly to a word document using RMarkdown:
One way to solve this, is to turn your data to a long(er) format. You can then use the same code to calculate the outcomes you want, with one extra group_by:
library(reshape2)
library(dplyr)
m_mtcars <- melt(mtcars,measure.vars=c("gear","carb","cyl"))
res <- m_mtcars %>%
group_by(am, variable, value) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
Building on this, the desired output can be obtained using more reshaping and some string formatting
#make an 'export' variable
res$export <- with(res, sprintf("%i (%.1f%%)", n, freq*100))
#reshape again
output <- dcast(variable+value~am, value.var="export", data=res, fill="missing") #use drop=F to prevent silent missings
#'silent missings'
output$variable <- as.character(output$variable)
#make 'empty lines'
empties <- data.frame(variable=unique(output$variable), stringsAsFactors=F)
empties[,colnames(output)[-1]] <- ""
#bind them together
output2 <- rbind(empties,output)
output2 <- output2[order(output2$variable,output2$value),]
#optional: 'remove' variable if value present
output2$variable[output2$value!=""] <- ""
This results in:
variable value 0 1
2 carb
7 1 3 (15.8%) 4 (30.8%)
8 2 6 (31.6%) 4 (30.8%)
9 3 3 (15.8%) missing
10 4 7 (36.8%) 3 (23.1%)
11 6 missing 1 (7.7%)
12 8 missing 1 (7.7%)
3 cyl
13 4 3 (15.8%) 8 (61.5%)
14 6 4 (21.1%) 3 (23.1%)
15 8 12 (63.2%) 2 (15.4%)
1 gear
4 3 15 (78.9%) missing
5 4 4 (21.1%) 8 (61.5%)
6 5 missing 5 (38.5%)
With tidyr/dplyr combination, here is how you would do it:
library(tidyr)
library(dplyr)
mtcars %>%
gather(variable, value, gear, carb, cyl) %>%
group_by(am, variable, value) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With