Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create dummies based on two columns in R

Assume I have a dataframe: Gender can take F as female or M as male Race can take A as Asian, W as White, B as Black and H as Hispanic

| id | Gender | Race |
| --- | ----- | ---- |
| 1   | F    | W |
| 2   | F    | B |
| 3   | M    | A |
| 4   | F    | B |
| 5   | M    | W |
| 6   | M    | B |
| 7   | F    | H |

And I want to have a set of columns as dummies base on Gender and Race, the dataframe should be like

| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1   | F    | W   |  1  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |
| 2   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 3   | M    | A   |  0  |  0  |  0  |  0  |  0  |  0  |  1  |  0  |
| 4   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 5   | M    | W   |  0  |  0  |  0  |  0  |  1  |  0  |  0  |  0  |
| 6   | M    | B   |  0  |  0  |  0  |  0  |  0  |  1  |  0  |  0  |
| 7   | F    | H   |  0  |  0  |  0  |  1  |  0  |  0  |  0  |  0  |

My actual data contains of much more categories than this example so I do appreciate if you can make it in a more neat way. The language is R. Thank you for your help.

like image 397
xxx Avatar asked Jul 17 '21 13:07

xxx


People also ask

How do I convert categorical data to dummy variables in R?

To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);

How do you create a dummy Dataframe in R?

To create an empty Data Frame in R, call data. frame() function, and pas no arguments to it. The function returns an empty Data Frame with zero rows and zero columns.

How do I put two columns together in R?

How do I concatenate two columns in R? To concatenate two columns you can use the <code>paste()</code> function. For example, if you want to combine the two columns A and B in the dataframe df you can use the following code: <code>df['AB'] <- paste(df$A, df$B)</code>.


5 Answers

Apart from the column names, you can get this with the model.matrix function and a formula expressing just the interaction terms, and subtracting an intercept:

> dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
> dm
   id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
1   1      F    H             0             0             0             0
2   2      M    H             0             0             0             0
3   3      M    W             0             0             0             0
4   4      F    H             0             0             0             0
5   5      M    H             0             0             0             0
[etc]

If you care about the exact names its easy enough to sort them out with a bit of string processing.

> names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
> dm
   id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
1   1      F    H   0   0   0   0   1   0   0   0
2   2      M    H   0   0   0   0   0   1   0   0
3   3      M    W   0   0   0   0   0   0   0   1
4   4      F    H   0   0   0   0   1   0   0   0
5   5      M    H   0   0   0   0   0   1   0   0
6   6      F    H   0   0   0   0   1   0   0   0
7   7      F    H   0   0   0   0   1   0   0   0
8   8      M    A   0   1   0   0   0   0   0   0
9   9      M    W   0   0   0   0   0   0   0   1
10 10      F    B   0   0   1   0   0   0   0   0

If you care about the column order....

like image 55
Spacedman Avatar answered Nov 15 '22 05:11

Spacedman


A base R option with table

 cbind(df1, as.data.frame.matrix(table(transform(df1, 
    GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
  id Gender Race F_B F_H F_W M_A M_B M_W
1  1      F    W   0   0   1   0   0   0
2  2      F    B   1   0   0   0   0   0
3  3      M    A   0   0   0   1   0   0
4  4      F    B   1   0   0   0   0   0
5  5      M    W   0   0   0   0   0   1
6  6      M    B   0   0   0   0   1   0
7  7      F    H   0   1   0   0   0   0

data

df1 <- structure(list(id = 1:7, Gender = c("F", "F", "M", "F", "M", 
"M", "F"), Race = c("W", "B", "A", "B", "W", "B", "H")), 
class = "data.frame", row.names = c(NA, 
-7L))
like image 38
akrun Avatar answered Nov 15 '22 05:11

akrun


Another base R option with xtabs

cbind(
    df,
    as.data.frame.matrix(
        xtabs(
            ~ id + q,
            transform(
                df,
                q = paste0(Gender, "_", Race)
            )
        )
    )
)

gives

  id Gender Race F_B F_H F_W M_A M_B M_W
1  1      F    W   0   0   1   0   0   0
2  2      F    B   1   0   0   0   0   0
3  3      M    A   0   0   0   1   0   0
4  4      F    B   1   0   0   0   0   0
5  5      M    W   0   0   0   0   0   1
6  6      M    B   0   0   0   0   1   0
7  7      F    H   0   1   0   0   0   0
like image 43
ThomasIsCoding Avatar answered Nov 15 '22 06:11

ThomasIsCoding


I think you can use the following solution. It has actually 2 variables fewer than your desired output where the output will be zero nonetheless. Since pivot_wider will spread all the combinations that can be found in the data set.

library(dplyr)
library(tidyr)

df %>%
  mutate(grp = 1) %>%
  pivot_wider(names_from = c(Gender, Race), values_from = grp, 
              values_fill = 0, names_glue = "{Gender}_{Race}") %>%
  right_join(df, by = "id") %>%
  relocate(id, Gender, Race)

# A tibble: 7 x 9
     id Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
  <int> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 F      W         1     0     0     0     0     0
2     2 F      B         0     1     0     0     0     0
3     3 M      A         0     0     1     0     0     0
4     4 F      B         0     1     0     0     0     0
5     5 M      W         0     0     0     1     0     0
6     6 M      B         0     0     0     0     1     0
7     7 F      H         0     0     0     0     0     1
like image 30
Anoushiravan R Avatar answered Nov 15 '22 07:11

Anoushiravan R


In addtion to the tidyverse solution from Anoushiravan R. Here is another option with unite, pivot_wider, across and case_when

library(tidyverse)
  df %>% 
    unite(comb, Gender:Race, remove = FALSE) %>% 
    pivot_wider(
      names_from = comb,
      values_from = comb
    ) %>% 
    mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H), 
                  ~ case_when(is.na(.) ~ 0, 
                              TRUE ~ 1)))

Output:

  id    Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
  <chr> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1     F      W         1     0     0     0     0     0
2 2     F      B         0     1     0     0     0     0
3 3     M      A         0     0     1     0     0     0
4 4     F      B         0     1     0     0     0     0
5 5     M      W         0     0     0     1     0     0
6 6     M      B         0     0     0     0     1     0
7 7     F      H         0     0     0     0     0     1
like image 27
TarJae Avatar answered Nov 15 '22 05:11

TarJae