Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Casting unique features in column to variable names and dummy coding original features into variables in R

Having an issue with how to dummy code the following dataset.

Example data, lets say dataframe = mydata:

ID |     NAMES      |
-- | -------------- |
1  | 4444, 333, 456 |
2  | 333            |
3  | 456, 765       |

I'd like to cast only the unique variables in NAMES as column variables and code if each row has that variable or not i.e 1 or 0

Desired Output:

ID |     NAMES      | 4444 | 333 | 456 | 765 |
-- | -------------- |------|-----|-----|-----|
1  | 4444, 333, 456 |   1  |  1  |  1  |   0 |
2  | 333            |   0  |  1  |  0  |   0 |
3  | 456, 765       |   0  |  0  |  1  |   1 |

what I've done so far is created a vector of unique

split <- str_split(string = mydata$NAMES,pattern = ",")

vec <- unique(str_trim(unlist(split)))
remove <- ""
vec <- as.data.frame(vec[! vec %in% remove])
colnames(vec) <- "var"
vecRef <- as.vector(vec$var)

namesCast <- dcast(data = vec,formula = .~var)
namesCast <- nameCast[,2:ncol(namesCast)]

This yields a vector of unique NAMES with spaces/irregularities removed. From there I have no idea how to do the matching/dummy coding so any help would be greatly appreciated!

like image 327
moku Avatar asked Dec 03 '14 15:12

moku


People also ask

How do I convert categorical variables to dummy variables in R?

To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);

How do you convert multiple categorical variables into dummy variables?

To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .

Does R convert factors to dummy variables?

7.1 Dummy Variables in R. R uses factor vectors to to represent dummy or categorical data.


1 Answers

You can use cSplit_e from my "splitstackshape" package, like this:

library(splitstackshape)
cSplit_e(mydata, "NAMES", sep = ",", type = "character", fill = 0)
#   ID          NAMES NAMES_333 NAMES_4444 NAMES_456 NAMES_765
# 1  1 4444, 333, 456         1          1         1         0
# 2  2            333         1          0         0         0
# 3  3       456, 765         0          0         1         1

If you want to see the underlying function that is called when you use those arguments, you can look at splitstackshape:::charMat, which takes a list generated by strsplit and creates a matrix from it.

Calling the function directly would give you something like this:

splitstackshape:::charMat(
  lapply(strsplit(as.character(mydata$NAMES), ","), 
         function(x) gsub("^\\s+|\\s$", "", x)))
#      333 4444 456 765
# [1,]   1    1   1  NA
# [2,]   1   NA  NA  NA
# [3,]  NA   NA   1   1 
like image 128
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 23 '22 14:09

A5C1D2H2I1M1N2O1R2T1