Assume I have a dataframe: Gender can take F as female or M as male Race can take A as Asian, W as White, B as Black and H as Hispanic
| id | Gender | Race |
| --- | ----- | ---- |
| 1 | F | W |
| 2 | F | B |
| 3 | M | A |
| 4 | F | B |
| 5 | M | W |
| 6 | M | B |
| 7 | F | H |
And I want to have a set of columns as dummies base on Gender and Race, the dataframe should be like
| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1 | F | W | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | F | B | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | M | A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | F | B | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | M | W | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 6 | M | B | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 7 | F | H | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
My actual data contains of much more categories than this example so I do appreciate if you can make it in a more neat way. The language is R. Thank you for your help.
To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);
To create an empty Data Frame in R, call data. frame() function, and pas no arguments to it. The function returns an empty Data Frame with zero rows and zero columns.
How do I concatenate two columns in R? To concatenate two columns you can use the <code>paste()</code> function. For example, if you want to combine the two columns A and B in the dataframe df you can use the following code: <code>df['AB'] <- paste(df$A, df$B)</code>.
Apart from the column names, you can get this with the model.matrix
function and a formula expressing just the interaction terms, and subtracting an intercept:
> dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
> dm
id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
1 1 F H 0 0 0 0
2 2 M H 0 0 0 0
3 3 M W 0 0 0 0
4 4 F H 0 0 0 0
5 5 M H 0 0 0 0
[etc]
If you care about the exact names its easy enough to sort them out with a bit of string processing.
> names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
> dm
id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
1 1 F H 0 0 0 0 1 0 0 0
2 2 M H 0 0 0 0 0 1 0 0
3 3 M W 0 0 0 0 0 0 0 1
4 4 F H 0 0 0 0 1 0 0 0
5 5 M H 0 0 0 0 0 1 0 0
6 6 F H 0 0 0 0 1 0 0 0
7 7 F H 0 0 0 0 1 0 0 0
8 8 M A 0 1 0 0 0 0 0 0
9 9 M W 0 0 0 0 0 0 0 1
10 10 F B 0 0 1 0 0 0 0 0
If you care about the column order....
A base R
option with table
cbind(df1, as.data.frame.matrix(table(transform(df1,
GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
id Gender Race F_B F_H F_W M_A M_B M_W
1 1 F W 0 0 1 0 0 0
2 2 F B 1 0 0 0 0 0
3 3 M A 0 0 0 1 0 0
4 4 F B 1 0 0 0 0 0
5 5 M W 0 0 0 0 0 1
6 6 M B 0 0 0 0 1 0
7 7 F H 0 1 0 0 0 0
df1 <- structure(list(id = 1:7, Gender = c("F", "F", "M", "F", "M",
"M", "F"), Race = c("W", "B", "A", "B", "W", "B", "H")),
class = "data.frame", row.names = c(NA,
-7L))
Another base R option with xtabs
cbind(
df,
as.data.frame.matrix(
xtabs(
~ id + q,
transform(
df,
q = paste0(Gender, "_", Race)
)
)
)
)
gives
id Gender Race F_B F_H F_W M_A M_B M_W
1 1 F W 0 0 1 0 0 0
2 2 F B 1 0 0 0 0 0
3 3 M A 0 0 0 1 0 0
4 4 F B 1 0 0 0 0 0
5 5 M W 0 0 0 0 0 1
6 6 M B 0 0 0 0 1 0
7 7 F H 0 1 0 0 0 0
I think you can use the following solution. It has actually 2 variables fewer than your desired output where the output will be zero nonetheless. Since pivot_wider
will spread all the combinations that can be found in the data set.
library(dplyr)
library(tidyr)
df %>%
mutate(grp = 1) %>%
pivot_wider(names_from = c(Gender, Race), values_from = grp,
values_fill = 0, names_glue = "{Gender}_{Race}") %>%
right_join(df, by = "id") %>%
relocate(id, Gender, Race)
# A tibble: 7 x 9
id Gender Race F_W F_B M_A M_W M_B F_H
<int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F W 1 0 0 0 0 0
2 2 F B 0 1 0 0 0 0
3 3 M A 0 0 1 0 0 0
4 4 F B 0 1 0 0 0 0
5 5 M W 0 0 0 1 0 0
6 6 M B 0 0 0 0 1 0
7 7 F H 0 0 0 0 0 1
In addtion to the tidyverse solution from Anoushiravan R.
Here is another option with unite
, pivot_wider
, across
and case_when
library(tidyverse)
df %>%
unite(comb, Gender:Race, remove = FALSE) %>%
pivot_wider(
names_from = comb,
values_from = comb
) %>%
mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H),
~ case_when(is.na(.) ~ 0,
TRUE ~ 1)))
Output:
id Gender Race F_W F_B M_A M_W M_B F_H
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F W 1 0 0 0 0 0
2 2 F B 0 1 0 0 0 0
3 3 M A 0 0 1 0 0 0
4 4 F B 0 1 0 0 0 0
5 5 M W 0 0 0 1 0 0
6 6 M B 0 0 0 0 1 0
7 7 F H 0 0 0 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With