Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dummify character column and find unique values [duplicate]

I have a dataframe with the following structure

test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))

Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below

data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))

  a ff cc rr e
1 1  1  1  1 0
2 1  0  1  1 1

I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.

like image 566
Michael Avatar asked Feb 22 '17 09:02

Michael


3 Answers

I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner, i.e.

library(splitstackshape)
library(qdapTools)

mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
#   a cc ff rr e
#V1 1  1  1  1 0
#V2 1  1  0  1 1

It can also be full splitstackshape as @A5C1D2H2I1M1N2O1R2T1 mentions in comments,

cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)
like image 174
Sotos Avatar answered Nov 03 '22 08:11

Sotos


Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row

library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
#    variable a cc e ff rr
# 1:        1 1  1 0  1  1
# 2:        2 1  1 1  0  1
like image 6
David Arenburg Avatar answered Nov 03 '22 08:11

David Arenburg


Here's a base R approach:

x   <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x))                         # get unique elements
x   <- lapply(x, factor, levels = lvl)           # convert to factor
t(sapply(x, table))                              # count elements and transpose
#     a ff cc rr e
#[1,] 1  1  1  1 0
#[2,] 1  0  1  1 1
like image 4
talat Avatar answered Nov 03 '22 09:11

talat