R: create dummy variables based on a categorical variable of lists [duplicate]

Tags:

I have a data frame with a categorical variable holding lists of strings, with variable length (it is important because otherwise this question would be a duplicate of this or this), e.g.:

df <- data.frame(x = 1:5)
df$y <- list("A", c("A", "B"), "C", c("B", "D", "C"), "E")
df

  x       y
1 1       A
2 2    A, B
3 3       C
4 4 B, D, C
5 5       E

And the desired form is a dummy variable for each unique string seen anywhere in df$y, i.e.:

data.frame(x = 1:5, A = c(1,1,0,0,0), B = c(0,1,0,1,0), C = c(0,0,1,1,0), D = c(0,0,0,1,0), E = c(0,0,0,0,1))

  x A B C D E
1 1 1 0 0 0 0
2 2 1 1 0 0 0
3 3 0 0 1 0 0
4 4 0 1 1 1 0
5 5 0 0 0 0 1

This naive approach works:

> uniqueStrings <- unique(unlist(df$y))
> n <- ncol(df)
> for (i in 1:length(uniqueStrings)) {
+   df[,  n + i] <- sapply(df$y, function(x) ifelse(uniqueStrings[i] %in% x, 1, 0))
+   colnames(df)[n + i] <- uniqueStrings[i]
+ }

However it is very ugly, lazy and slow with big data frames.

Any suggestions? Something fancy from the tidyverse?

UPDATE: I got 3 different approaches below. I tested them using system.time on my (Windows 7, 32GB RAM) laptop on a real dataset, comprising of 1M rows, each row containing a list of length 1 to 4 strings (out of ~350 unique string values), overall 200MB on disk. So the expected result is a data frame with dimensions 1M x 350. The tidyverse (@Sotos) and base (@joel.wilson) approaches took so long I had to restart R. The qdapTools (@akrun) approach however worked fantastic:

> system.time(res1 <- mtabulate(varsLists))
   user  system elapsed 
  47.05   10.27  116.82

So this is the approach I'll mark accepted.

912

asked Jan 16 '17 08:01

Giora Simchoni

2 Answers

Another idea,

library(dplyr)
library(tidyr)

df %>% 
 unnest(y) %>% 
 mutate(new = 1) %>% 
 spread(y, new, fill = 0) 

#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

Further to the cases you mentioned in comments, we can use dcast from reshape2 as it is more flexible than spread,

df2 <- df %>% 
        unnest(y) %>% 
        group_by(x) %>% 
        filter(!duplicated(y)) %>% 
        ungroup()

reshape2::dcast(df2, x ~ y, value.var = 'y', length)

#  x A B C D E
#1 1 1 0 0 0 0
#2 2 1 1 0 0 0
#3 3 0 0 1 0 0
#4 4 0 1 1 1 0
#5 5 0 0 0 0 1

#or with df$x <- c(1, 1, 2, 2, 3)

#  x A B C D E
#1 1 1 1 0 0 0
#2 2 0 1 1 1 0
#3 3 0 0 0 0 1

#or with df$x <- rep(1,5)

#  x A B C D E
#1 1 1 1 1 1 1

161

answered Sep 19 '22 03:09

Sotos

this involves no external packages,

# thanks to Sotos for suggesting to use `unique(unlist(df$y))` instead of `LETTERS[1!:5]`
sapply(unique(unlist(df$y)), function(j) as.numeric(grepl(j, df$y)))
#     A B C D E
#[1,] 1 0 0 0 0
#[2,] 1 1 0 0 0
#[3,] 0 0 1 0 0
#[4,] 0 1 1 1 0
#[5,] 0 0 0 0 1

answered Sep 21 '22 03:09

joel.wilson

Related questions
                            
                                passing data frame to mutate within function
                            
                                How to exit a sourced R script
                            
                                Scraping javascript website in R
                            
                                Creating "word" cloud of phrases, not individual words in R
                            
                                Annotate first month with year in ggplot2
                            
                                Adding a density line to a histogram with count data in ggplot2
                            
                                conditionalPanel javascript conditions in shiny: is there R %in% operator in javascript?
                            
                                How do I find "origin" of a Date in R
                            
                                Shinydashboard dashboardSidebar Width Setting
                            
                                R, knitr, and source function: How to preserve source file comments for html report
                            
                                Running a Powershell script from R using system2() rather than system()
                            
                                ggplot tile line between cells
                            
                                How to sort a data.table using a target vector
                            
                                Image in R Leaflet marker popups
                            
                                How do I split a string with tidyr::separate in R and retain the values of the separator string?
                            
                                ggplot2: geom_ribbon with alpha dependent on data density along y-axis for each x
                            
                                How does geom_map "map_id" function work?
                            
                                R - Image Plot MNIST dataset
                            
                                R: How can I calculate large numbers in n-choose-k? [duplicate]
                            
                                install_github with --no-multiarch argument

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R: create dummy variables based on a categorical variable of lists [duplicate]

Tags:

list

r

dummy-variable

tidyverse

Giora Simchoni

People also ask

2 Answers

Sotos

joel.wilson

Recent Activity

Donate For Us

R: create dummy variables based on a categorical variable *of lists* [duplicate]

Tags:

list

r

dummy-variable

tidyverse

Giora Simchoni

People also ask

2 Answers

Sotos

joel.wilson

Related questions

Recent Activity

Donate For Us

R: create dummy variables based on a categorical variable of lists [duplicate]