Having an issue with how to dummy code the following dataset.
Example data, lets say dataframe = mydata:
ID | NAMES |
-- | -------------- |
1 | 4444, 333, 456 |
2 | 333 |
3 | 456, 765 |
I'd like to cast only the unique variables in NAMES as column variables and code if each row has that variable or not i.e 1 or 0
Desired Output:
ID | NAMES | 4444 | 333 | 456 | 765 |
-- | -------------- |------|-----|-----|-----|
1 | 4444, 333, 456 | 1 | 1 | 1 | 0 |
2 | 333 | 0 | 1 | 0 | 0 |
3 | 456, 765 | 0 | 0 | 1 | 1 |
what I've done so far is created a vector of unique
split <- str_split(string = mydata$NAMES,pattern = ",")
vec <- unique(str_trim(unlist(split)))
remove <- ""
vec <- as.data.frame(vec[! vec %in% remove])
colnames(vec) <- "var"
vecRef <- as.vector(vec$var)
namesCast <- dcast(data = vec,formula = .~var)
namesCast <- nameCast[,2:ncol(namesCast)]
This yields a vector of unique NAMES with spaces/irregularities removed. From there I have no idea how to do the matching/dummy coding so any help would be greatly appreciated!
To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);
To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .
7.1 Dummy Variables in R. R uses factor vectors to to represent dummy or categorical data.
You can use cSplit_e
from my "splitstackshape" package, like this:
library(splitstackshape)
cSplit_e(mydata, "NAMES", sep = ",", type = "character", fill = 0)
# ID NAMES NAMES_333 NAMES_4444 NAMES_456 NAMES_765
# 1 1 4444, 333, 456 1 1 1 0
# 2 2 333 1 0 0 0
# 3 3 456, 765 0 0 1 1
If you want to see the underlying function that is called when you use those arguments, you can look at splitstackshape:::charMat
, which takes a list
generated by strsplit
and creates a matrix
from it.
Calling the function directly would give you something like this:
splitstackshape:::charMat(
lapply(strsplit(as.character(mydata$NAMES), ","),
function(x) gsub("^\\s+|\\s$", "", x)))
# 333 4444 456 765
# [1,] 1 1 1 NA
# [2,] 1 NA NA NA
# [3,] NA NA 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With