I have some dataset similar to this:
df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))
I'm looking for a guidance in finding a way to split variable x into multiple categorical variables with range 0-1
In the end it would look like this:
n x A B C D E F G H . . .
1 D 0 0 0 1 0 0 0 0 . . .
2 B 0 1 0 0 0 0 0 0 . . .
3 F 0 0 0 0 0 1 0 0 . . .
In my dataset, there's way more codes in variable x so adding each new variable manually would be too time consuming.
I was thinking about sorting codes in var x and assigning them an unique number each, then creating an iterating loop that creates new variable for each code in variable x. But i feel like i'm overcomplicating things
A fast and easy way is to use fastDummies::dummy_cols:
fastDummies::dummy_cols(df, "x")
An alternative with tidyverse functions:
library(tidyverse)
df %>% 
  left_join(., df %>% mutate(value = 1) %>% 
              pivot_wider(names_from = x, values_from = value, values_fill = 0) %>% 
              relocate(n, sort(colnames(.)[-1])))
output
> dummmy <- fastDummies::dummy_cols(df, "x")
> colnames(dummy)[-c(1,2)] <- LETTERS
> dummy
    n x A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1   1 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2   2 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
3   3 E 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4   4 H 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5   5 T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
6   6 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
7   7 R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
8   8 F 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9   9 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
10 10 S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
Benchmark Since there are many solutions and the question involves a large dataset, a benchmark might help. The nnet solution is the fastest according to the benchmark.
set.seed(1)
df <- data.frame(n = seq(1:1000000), x = sample(LETTERS, 1000000, replace = T))
library(microbenchmark)
bm <- microbenchmark(
  fModel.matrix(),
  fContrasts(),
  fnnet(),
  fdata.table(),
  fFastDummies(),
  fDplyr(),
  times = 10L,
  setup = gc(FALSE)
)
autoplot(bm)

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With