Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import Excel table in R in order to get binary variables?

My dataset looks exactly like this (only with much more observations and attributes): Dataset.xlsx

I want each attribute to have a binary variable with the value 1 if PersonX has it and otherwise 0, but each person must also include the attributes of others. It should look like this, where of course if Person1 has also the same attribute as Person2, the variable shouldn't be generated again :

ID       Class_Label    A469 T593 K022K A835 Z935 U83F W5326
Person1  TRUE           1    1    1     0    0    0    0
Person2  FALSE          0    1    0     1    1    0    0
Person3  FALSE          0    0    1     0    0    1    1

As you can see Person1 and Person3 have the attribute: K022K in common and Person1 and Person2 T593. Is there any way how to solve this?

like image 739
Textime Avatar asked Mar 05 '23 07:03

Textime


2 Answers

library(tidyverse)

df <- tibble(
  id = paste0("Person", 1:3),
  class_label = c(TRUE, FALSE, FALSE),
  attribute = c("A469/T593/K022K", "A835/Z935/T593", "U835F/W5326/K022K")
)
df
#> # A tibble: 3 x 3
#>   id      class_label attribute        
#>   <chr>   <lgl>       <chr>            
#> 1 Person1 TRUE        A469/T593/K022K  
#> 2 Person2 FALSE       A835/Z935/T593   
#> 3 Person3 FALSE       U835F/W5326/K022K

df %>%
  separate_rows(attribute, sep = "/") %>%
  mutate(i = 1) %>%
  spread(attribute, i, fill = 0)
#> # A tibble: 3 x 9
#>   id      class_label  A469  A835 K022K  T593 U835F W5326  Z935
#>   <chr>   <lgl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Person1 TRUE            1     0     1     1     0     0     0
#> 2 Person2 FALSE           0     1     0     1     0     0     1
#> 3 Person3 FALSE           0     0     1     0     1     1     0

Taking note that what you are trying to do is frequently referred to as one-hot encoding or binary encoding. In addition, you may want to note that df %>% separate_rows(attribute, sep = "/") your data is in a tidy format which could have other benefits.

UPDATE: To extend to more columns, you might want to first determine which attributes you will be encoding. So something like, select(df, contains("attribute")) or select(df, 3:4).

df <- tibble(
  id = paste0("Person", 1:3),
  class_label = c(TRUE, FALSE, FALSE),
  attribute = c("A469/T593/K022K", "A835/Z935/T593", "U835F/W5326/K022K"),
  attribute2 = c("one/two/three", "four/five/six", "one/five/six")
)
df
#> # A tibble: 3 x 4
#>   id      class_label attribute         attribute2   
#>   <chr>   <lgl>       <chr>             <chr>        
#> 1 Person1 TRUE        A469/T593/K022K   one/two/three
#> 2 Person2 FALSE       A835/Z935/T593    four/five/six
#> 3 Person3 FALSE       U835F/W5326/K022K one/five/six

one_hot <- function(data, att) {
  quo_att <- enquo(att)
  data %>%
    select(id, class_label, !! quo_att) %>% 
    separate_rows(!! quo_att, sep = "/") %>%
    mutate(i = 1) %>%
    spread(!! quo_att, i, fill = 0) %>%
    select(-id, -class_label)
}


attributes_to_map <- select(df, contains("attribute")) %>% names
attributes_to_map
#> [1] "attribute"  "attribute2"

attributes_to_map %>%
  map_dfc(~ one_hot(df, .)) %>%
  bind_cols(select(df, id, class_label)) %>%
  select(id, class_label, everything())
#> # A tibble: 3 x 15
#>   id    class_label  A469  A835 K022K  T593 U835F W5326  Z935  five  four
#>   <chr> <lgl>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Pers~ TRUE            1     0     1     1     0     0     0     0     0
#> 2 Pers~ FALSE           0     1     0     1     0     0     1     1     1
#> 3 Pers~ FALSE           0     0     1     0     1     1     0     1     0
#> # ... with 4 more variables: one <dbl>, six <dbl>, three <dbl>, two <dbl>

But at this point, you may also want to consider the recipes package or do searches on one-hot encoding for several variables.

like image 105
JasonAizkalns Avatar answered Mar 07 '23 01:03

JasonAizkalns


Ok, using your example as minimal.xlsx:

Minimal example CSV

install.packages('readxl')  # if you don't have this already
library(readxl)

example <- read_excel('./minimal.xlsx')  # assuming file is in working directory
example$Attribute <- as.character(example$Attribute)  # convert to character

attrs <- strsplit(example$Attribute, '/')  # split by /
attrs <- unlist(attrs)  # flatten the list
attrs <- unique(attrs)  # extract uniques

for (attr in attrs) {
  attr_row <- grepl(attr, example$Attribute)  # boolean of True/false
  attr_row <- attr_row * 1  # convert to 0, 1
  example[attr] <- attr_row
}

I tried to explain it in the comments, but in essence:

  • Turn the Attributes into characters and split them on your denoting character
  • Combine them into a "set" vector of unique Attributes
  • Loop through them, and generate each row
  • Append each row back into the DataFrame

The result is here:

Final output

You can delete the original Attribute column afterward too, but that should get you what you want, is a generalized solution, and doesn't require outside libraries.

Edit: The other answer is shorter and definitely use it for quick wrangling on this, personally I often like to use base R if I can for small tasks like this, especially for scripts I want to share with others.

like image 21
James Wang Avatar answered Mar 07 '23 00:03

James Wang