Dummy code categorical / ordinal variables in the tidyverse r

Tags:

Let's say I have a tibble.

library(tidyverse) 
tib <- as.tibble(list(record = c(1:10), 
                      gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)), 
                      like_product = as.factor(sample(1:5, 10, replace = TRUE)))
tib

    # A tibble: 10 x 3
   record gender like_product
    <int> <fctr>       <fctr>
 1      1      F            2
 2      2      M            1
 3      3      M            2
 4      4      F            3
 5      5      F            4
 6      6      M            2
 7      7      F            4
 8      8      M            4
 9      9      F            4
10     10      M            5

I would like to dummy code my data with 1's and 0's so that the data looks more/less like this.

# A tibble: 10 x 8
   record gender_M gender_F like_product_1 like_product_2 like_product_3 like_product_4 like_product_5
    <int>    <dbl>    <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>
 1      1        0        1              0              0              1              0              0
 2      2        0        1              0              0              0              0              0
 3      3        0        1              0              1              0              0              0
 4      4        0        1              1              0              0              0              0
 5      5        1        0              0              0              0              0              0
 6      6        0        1              0              0              0              0              0
 7      7        0        1              0              0              0              0              0
 8      8        0        1              0              1              0              0              0
 9      9        1        0              0              0              0              0              0
10     10        1        0              0              0              0              0              1

My workflow would require that I know a range of variables to dummy code (i.e. gender:like_product), but don't want to identify EVERY variable by hand (there could be hundreds of variables). Likewise, I don't want to have to identify every level/unique value of every variable to dummy code. I'm ultimately looking for a tidyverse solution.

I know of several ways of doing this, but none of them that fit perfectly within tidyverse. I know I could use mutate...

tib %>%
     mutate(gender_M = ifelse(gender == "M", 1, 0), 
            gender_F = ifelse(gender == "F", 1, 0), 
            like_product_1 = ifelse(like_product == 1, 1, 0), 
            like_product_2 = ifelse(like_product == 2, 1, 0), 
            like_product_3 = ifelse(like_product == 3, 1, 0), 
            like_product_4 = ifelse(like_product == 4, 1, 0), 
            like_product_5 = ifelse(like_product == 5, 1, 0)) %>%
     select(-gender, -like_product)

But this would break my workflow rules of needing to specify every dummy coded output.

I've done this in the past with model.matrix, from the stats package.

model.matrix(~ gender + like_product, tib)

Easy and straightforward, but I want a solution in the tidyverse. EDIT: Reason being, I still have to specify every variable, and being able to use select helpers to specify something like gender:like_product would be much preferred.

I think the solution is in purrr

library(purrr)
dummy_code <- function(x) {
     lvls <- levels(x)
     sapply(lvls, function(y) as.integer(x == y)) %>% as.tibble
} 

tib %>%
     map_at(c("gender", "like_product"), dummy_code)

$record
 [1]  1  2  3  4  5  6  7  8  9 10

$gender
# A tibble: 10 x 2
       F     M
   <int> <int>
 1     1     0
 2     0     1
 3     0     1
 4     1     0
 5     1     0
 6     0     1
 7     1     0
 8     0     1
 9     1     0
10     0     1

$like_product
# A tibble: 10 x 5
     `1`   `2`   `3`   `4`   `5`
   <int> <int> <int> <int> <int>
 1     0     1     0     0     0
 2     1     0     0     0     0
 3     0     1     0     0     0
 4     0     0     1     0     0
 5     0     0     0     1     0
 6     0     1     0     0     0
 7     0     0     0     1     0
 8     0     0     0     1     0
 9     0     0     0     1     0
10     0     0     0     0     1

This attempt produces a list of tibbles, with the exception of the excluded variable record, and I've been unsuccessful at combining them all back into a single tibble. Additionally, I still have to specify every column, and overall it seems clunky.

Any better ideas? Thanks!!

854

asked Mar 22 '18 16:03

Jacob Nelson

1 Answers

An alternative to model.matrix is using the package recipes. This is still a work in progress and is not yet included in the tidyverse. At some point it might / will be included in the tidyverse packages.

I will leave it up to you to read up on recipes, but in the step step_dummy you can use special selectors from the tidyselect package (installed with recipes) like the selectors you can use in dplyr as starts_with(). I created a little example to show the steps.

Example code below.

But if this is handier I will leave up to you as this has already been pointed out in the comments. The function bake() uses model.matrix to create the dummies. The difference is mostly in the column names and of course in the internal checks that are being done in the underlying code of all the separate steps.

library(recipes)
library(tibble)

tib <- as.tibble(list(record = c(1:10), 
                      gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)), 
                      like_product = as.factor(sample(1:5, 10, replace = TRUE))))

dum <- tib %>% 
  recipe(~ .) %>% 
  step_dummy(gender, like_product) %>% 
  prep(training = tib) %>% 
  bake(newdata = tib)

dum

# A tibble: 10 x 6
   record gender_M like_product_X2 like_product_X3 like_product_X4 like_product_X5
    <int>    <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
 1      1       1.              1.              0.              0.              0.
 2      2       1.              1.              0.              0.              0.
 3      3       1.              1.              0.              0.              0.
 4      4       0.              0.              1.              0.              0.
 5      5       0.              0.              0.              0.              0.
 6      6       0.              1.              0.              0.              0.
 7      7       0.              1.              0.              0.              0.
 8      8       0.              0.              0.              1.              0.
 9      9       0.              0.              0.              0.              1.
10     10       1.              0.              0.              0.              0.

127

answered Oct 21 '22 19:10

phiver

Related questions
                            
                                R Hex to RGB converter
                            
                                Using ggfortify and ggrepel for pca
                            
                                Can't load files using system.file or file.path in R?
                            
                                How to use data within a function in an R package?
                            
                                How to add label to geom_segment at the start of the segment?
                            
                                R optparse error with command line arguments
                            
                                how to find top N descending values in group in dplyr
                            
                                Shiny Dashboadpage lock dashboardHeader on top
                            
                                How to pass user and password in new_handle in curl R
                            
                                Change the caption title of a figure in markdown
                            
                                How to filter by a string containing variables in dbplyr [duplicate]
                            
                                How to Use na.rm=TRUE with n() While Using Dplyr's Group_by and Summarise_at
                            
                                Why isn't \\b in gsubfn in R working for me?
                            
                                Add ribbon showing mean and interquartile range to ggplot2
                            
                                Count the number of non-NA numeric values of each row in dplyr
                            
                                str_replace_all replacing named vector elements iteratively not all at once
                            
                                Converting Images to Black and White for Image Recognition in R
                            
                                How to store the returned value from a Shiny module in reactiveValues?
                            
                                Plot divergent stacked bar chart with ggplot2
                            
                                How can I increase precision in R when calculating with probabilities close to 0 and 1?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dummy code categorical / ordinal variables in the tidyverse r

Tags:

r

purrr

dummy-variable

tidyverse

Jacob Nelson

People also ask

1 Answers

phiver

Recent Activity

Donate For Us