My dataset looks exactly like this (only with much more observations and attributes):
I want each attribute to have a binary variable with the value 1 if PersonX has it and otherwise 0, but each person must also include the attributes of others. It should look like this, where of course if Person1 has also the same attribute as Person2, the variable shouldn't be generated again :
ID Class_Label A469 T593 K022K A835 Z935 U83F W5326
Person1 TRUE 1 1 1 0 0 0 0
Person2 FALSE 0 1 0 1 1 0 0
Person3 FALSE 0 0 1 0 0 1 1
As you can see Person1 and Person3 have the attribute: K022K in common and Person1 and Person2 T593. Is there any way how to solve this?
library(tidyverse)
df <- tibble(
id = paste0("Person", 1:3),
class_label = c(TRUE, FALSE, FALSE),
attribute = c("A469/T593/K022K", "A835/Z935/T593", "U835F/W5326/K022K")
)
df
#> # A tibble: 3 x 3
#> id class_label attribute
#> <chr> <lgl> <chr>
#> 1 Person1 TRUE A469/T593/K022K
#> 2 Person2 FALSE A835/Z935/T593
#> 3 Person3 FALSE U835F/W5326/K022K
df %>%
separate_rows(attribute, sep = "/") %>%
mutate(i = 1) %>%
spread(attribute, i, fill = 0)
#> # A tibble: 3 x 9
#> id class_label A469 A835 K022K T593 U835F W5326 Z935
#> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Person1 TRUE 1 0 1 1 0 0 0
#> 2 Person2 FALSE 0 1 0 1 0 0 1
#> 3 Person3 FALSE 0 0 1 0 1 1 0
Taking note that what you are trying to do is frequently referred to as one-hot encoding
or binary encoding
. In addition, you may want to note that df %>% separate_rows(attribute, sep = "/")
your data is in a tidy format which could have other benefits.
UPDATE:
To extend to more columns, you might want to first determine which attributes you will be encoding. So something like, select(df, contains("attribute"))
or select(df, 3:4)
.
df <- tibble(
id = paste0("Person", 1:3),
class_label = c(TRUE, FALSE, FALSE),
attribute = c("A469/T593/K022K", "A835/Z935/T593", "U835F/W5326/K022K"),
attribute2 = c("one/two/three", "four/five/six", "one/five/six")
)
df
#> # A tibble: 3 x 4
#> id class_label attribute attribute2
#> <chr> <lgl> <chr> <chr>
#> 1 Person1 TRUE A469/T593/K022K one/two/three
#> 2 Person2 FALSE A835/Z935/T593 four/five/six
#> 3 Person3 FALSE U835F/W5326/K022K one/five/six
one_hot <- function(data, att) {
quo_att <- enquo(att)
data %>%
select(id, class_label, !! quo_att) %>%
separate_rows(!! quo_att, sep = "/") %>%
mutate(i = 1) %>%
spread(!! quo_att, i, fill = 0) %>%
select(-id, -class_label)
}
attributes_to_map <- select(df, contains("attribute")) %>% names
attributes_to_map
#> [1] "attribute" "attribute2"
attributes_to_map %>%
map_dfc(~ one_hot(df, .)) %>%
bind_cols(select(df, id, class_label)) %>%
select(id, class_label, everything())
#> # A tibble: 3 x 15
#> id class_label A469 A835 K022K T593 U835F W5326 Z935 five four
#> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Pers~ TRUE 1 0 1 1 0 0 0 0 0
#> 2 Pers~ FALSE 0 1 0 1 0 0 1 1 1
#> 3 Pers~ FALSE 0 0 1 0 1 1 0 1 0
#> # ... with 4 more variables: one <dbl>, six <dbl>, three <dbl>, two <dbl>
But at this point, you may also want to consider the recipes package or do searches on one-hot encoding for several variables.
Ok, using your example as minimal.xlsx
:
install.packages('readxl') # if you don't have this already
library(readxl)
example <- read_excel('./minimal.xlsx') # assuming file is in working directory
example$Attribute <- as.character(example$Attribute) # convert to character
attrs <- strsplit(example$Attribute, '/') # split by /
attrs <- unlist(attrs) # flatten the list
attrs <- unique(attrs) # extract uniques
for (attr in attrs) {
attr_row <- grepl(attr, example$Attribute) # boolean of True/false
attr_row <- attr_row * 1 # convert to 0, 1
example[attr] <- attr_row
}
I tried to explain it in the comments, but in essence:
The result is here:
You can delete the original Attribute column afterward too, but that should get you what you want, is a generalized solution, and doesn't require outside libraries.
Edit: The other answer is shorter and definitely use it for quick wrangling on this, personally I often like to use base R if I can for small tasks like this, especially for scripts I want to share with others.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With