I love how easy <code>dplyr</code> and <code>tidyr</code> have made it to create a single summary table with multiple predictor and outcome variables. One thing that got me stumped was the final step of preserving/defining the order of the predictor variables, and their factor levels, in the output table. I've come up with a solution of sorts (below), which involves using <code>mutate</code> to manually make a factor variable that combines both the predictor and predictor value (eg. "gender_female") with levels in the desired output order. But my solution is a bit long winded if there are many variables, and I wonder if there is a better way? <pre class="prettyprint"><code>library(dplyr) library(tidyr) levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown") levels_gnd <- c("Female", "Male", "Unknown") set.seed(1234) dat <- data.frame( gender = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd), ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth), outcome1 = sample(c(TRUE, FALSE), 100, replace = TRUE), outcome2 = sample(c(TRUE, FALSE), 100, replace = TRUE) ) dat %>% gather(key = outcome, value = outcome_value, contains("outcome")) %>% gather(key = predictor, value = pred_value, gender, ethnicity) %>% # Statement below creates variable for ordering output mutate( pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"), levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"), paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_"))) ) %>% group_by(pred_ord, outcome) %>% summarise(n = sum(outcome_value, na.rm = TRUE)) %>% ungroup() %>% spread(key = outcome, value = n) %>% separate(pred_ord, c("Predictor", "Pred_value")) Source: local data frame [9 x 4] Predictor Pred_value outcome1 outcome2 (chr) (chr) (int) (int) 1 gender Female 25 27 2 gender Male 11 10 3 gender Unknown 12 15 4 ethnicity Maori 10 9 5 ethnicity Pacific 7 7 6 ethnicity Asian 6 12 7 ethnicity Other 10 9 8 ethnicity European 5 4 9 ethnicity Unknown 10 11 Warning message: attributes are not identical across measure variables; they will be dropped </code></pre> The table above is correct in that neither the Predictor nor Predictor values are resorted alphabetically. EDIT As requested, this is what is produced if the default ordering (alphabetical) is used. It makes sense in that when the factors are combined they are converted to a character variable and all attributes are dropped. <pre class="prettyprint"><code>dat %>% gather(key = outcome, value = outcome_value, contains("outcome")) %>% gather(key = predictor, value = pred_value, gender, ethnicity) %>% group_by(predictor, pred_value, outcome) %>% summarise(n = sum(outcome_value, na.rm = TRUE)) %>% spread(key = outcome, value = n) Source: local data frame [9 x 4] predictor pred_value outcome1 outcome2 (chr) (chr) (int) (int) 1 ethnicity Asian 6 12 2 ethnicity European 5 4 3 ethnicity Maori 10 9 4 ethnicity Other 10 9 5 ethnicity Pacific 7 7 6 ethnicity Unknown 10 11 7 gender Female 25 27 8 gender Male 11 10 9 gender Unknown 12 15 Warning message: attributes are not identical across measure variables; they will be dropped </code></pre>

If you want your data to be factors arranged as such, you'll need to convert them back to factors, as <code>gather</code> coerces to character (which it warns you about). You can use <code>gather</code>'s <code>factor_key</code> parameter to take care of <code>predictor</code>, but you'll need to assemble levels for <code>pred_value</code> as it now combines two factors from the original. Simplifying a bit: <pre class="prettyprint"><code>library(tidyr) library(dplyr) dat %>% gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>% group_by(predictor, pred_value) %>% summarise_all(sum) %>% ungroup() %>% mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd), fromLast = TRUE))) %>% arrange(predictor, pred_value) ## # A tibble: 9 × 4 ## predictor pred_value outcome1 outcome2 ## <fctr> <fctr> <int> <int> ## 1 gender Female 25 27 ## 2 gender Male 11 10 ## 3 gender Unknown 12 15 ## 4 ethnicity Maori 10 9 ## 5 ethnicity Pacific 7 7 ## 6 ethnicity Asian 6 12 ## 7 ethnicity Other 10 9 ## 8 ethnicity European 5 4 ## 9 ethnicity Unknown 10 11 </code></pre> Note that you'll need to use <code>unique</code> with <code>fromLast = TRUE</code> to arrange the duplicate "Unknown" values into a single occurrence in the right place; <code>union</code> will put it earlier.

Preserve order of input variables and factor levels in summary table, using dplyr tidyr

Tags:

r

dplyr

tidyr

I love how easy dplyr and tidyr have made it to create a single summary table with multiple predictor and outcome variables. One thing that got me stumped was the final step of preserving/defining the order of the predictor variables, and their factor levels, in the output table.

I've come up with a solution of sorts (below), which involves using mutate to manually make a factor variable that combines both the predictor and predictor value (eg. "gender_female") with levels in the desired output order. But my solution is a bit long winded if there are many variables, and I wonder if there is a better way?

library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")

set.seed(1234)

dat <- data.frame(
  gender    = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
  ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
  outcome1  = sample(c(TRUE, FALSE), 100, replace = TRUE),
  outcome2  = sample(c(TRUE, FALSE), 100, replace = TRUE)
)

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, gender, ethnicity) %>%
  # Statement below creates variable for ordering output
  mutate(
    pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"),
                      levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"),
                                 paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_")))
  ) %>%
  group_by(pred_ord, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  ungroup() %>%
  spread(key = outcome, value = n) %>%
  separate(pred_ord, c("Predictor", "Pred_value"))

Source: local data frame [9 x 4]

  Predictor Pred_value outcome1 outcome2
      (chr)      (chr)    (int)    (int)
1    gender     Female       25       27
2    gender       Male       11       10
3    gender    Unknown       12       15
4 ethnicity      Maori       10        9
5 ethnicity    Pacific        7        7
6 ethnicity      Asian        6       12
7 ethnicity      Other       10        9
8 ethnicity   European        5        4
9 ethnicity    Unknown       10       11
Warning message:
attributes are not identical across measure variables; they will be dropped

The table above is correct in that neither the Predictor nor Predictor values are resorted alphabetically.

EDIT

As requested, this is what is produced if the default ordering (alphabetical) is used. It makes sense in that when the factors are combined they are converted to a character variable and all attributes are dropped.

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, gender, ethnicity) %>%
  group_by(predictor, pred_value, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  spread(key = outcome, value = n)

Source: local data frame [9 x 4]

  predictor pred_value outcome1 outcome2
      (chr)      (chr)    (int)    (int)
1 ethnicity      Asian        6       12
2 ethnicity   European        5        4
3 ethnicity      Maori       10        9
4 ethnicity      Other       10        9
5 ethnicity    Pacific        7        7
6 ethnicity    Unknown       10       11
7    gender     Female       25       27
8    gender       Male       11       10
9    gender    Unknown       12       15
Warning message:
attributes are not identical across measure variables; they will be dropped

460

asked Aug 26 '16 01:08

JWilliman

1 Answers

If you want your data to be factors arranged as such, you'll need to convert them back to factors, as gather coerces to character (which it warns you about). You can use gather's factor_key parameter to take care of predictor, but you'll need to assemble levels for pred_value as it now combines two factors from the original. Simplifying a bit:

library(tidyr)
library(dplyr)

dat %>% 
    gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>%
    group_by(predictor, pred_value) %>% 
    summarise_all(sum) %>%
    ungroup() %>% 
    mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd), 
                                                           fromLast = TRUE))) %>% 
    arrange(predictor, pred_value)

## # A tibble: 9 × 4
##   predictor pred_value outcome1 outcome2
##      <fctr>     <fctr>    <int>    <int>
## 1    gender     Female       25       27
## 2    gender       Male       11       10
## 3    gender    Unknown       12       15
## 4 ethnicity      Maori       10        9
## 5 ethnicity    Pacific        7        7
## 6 ethnicity      Asian        6       12
## 7 ethnicity      Other       10        9
## 8 ethnicity   European        5        4
## 9 ethnicity    Unknown       10       11

Note that you'll need to use unique with fromLast = TRUE to arrange the duplicate "Unknown" values into a single occurrence in the right place; union will put it earlier.

171

answered Oct 06 '22 00:10

alistaire

Related questions
                            
                                column name with brackets or other punctuations for dplyr group_by
                            
                                R: Creating a vector with a specific amount of random numbers
                            
                                ggplot2 sourcing error: X11 library is missing
                            
                                Count values higher than a certain threshold by group
                            
                                r search along a vector and calculate the mean
                            
                                Proper R Markdown Code Organization
                            
                                Test if column name contains string in R
                            
                                Removing one tableGrob when applied to a box plot with a facet_wrap
                            
                                How to delete everything after nth delimiter in R?
                            
                                How can I import SAS format files into R?
                            
                                Dynamically sorting columns in dplyr via passing ordered vector with column names to select
                            
                                Plot 2 tmap objects side-by-side
                            
                                Is there a function to recognize a word?
                            
                                How to combine two rows in R?
                            
                                Why is standard R median function so much slower than a simple C++ alternative?
                            
                                Aggregate data.frame for each day
                            
                                Faster way to unlist a list of large matrices?
                            
                                How to get the table counts for unique values in column
                            
                                Extract pattern from string in R without distinguishing between upper and lower case letters
                            
                                Shift geom_bar right (not center-aligned)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With