Having an issue with how to dummy code the following dataset. Example data, lets say dataframe = mydata: <pre class="prettyprint"><code>ID | NAMES | -- | -------------- | 1 | 4444, 333, 456 | 2 | 333 | 3 | 456, 765 | </code></pre> I'd like to cast only the unique variables in NAMES as column variables and code if each row has that variable or not i.e 1 or 0 Desired Output: <pre class="prettyprint"><code>ID | NAMES | 4444 | 333 | 456 | 765 | -- | -------------- |------|-----|-----|-----| 1 | 4444, 333, 456 | 1 | 1 | 1 | 0 | 2 | 333 | 0 | 1 | 0 | 0 | 3 | 456, 765 | 0 | 0 | 1 | 1 | </code></pre> what I've done so far is created a vector of unique <pre class="prettyprint"><code>split <- str_split(string = mydata$NAMES,pattern = ",") vec <- unique(str_trim(unlist(split))) remove <- "" vec <- as.data.frame(vec[! vec %in% remove]) colnames(vec) <- "var" vecRef <- as.vector(vec$var) namesCast <- dcast(data = vec,formula = .~var) namesCast <- nameCast[,2:ncol(namesCast)] </code></pre> This yields a vector of unique NAMES with spaces/irregularities removed. From there I have no idea how to do the matching/dummy coding so any help would be greatly appreciated!

You can use <code>cSplit_e</code> from my "splitstackshape" package, like this: <pre class="prettyprint"><code>library(splitstackshape) cSplit_e(mydata, "NAMES", sep = ",", type = "character", fill = 0) # ID NAMES NAMES_333 NAMES_4444 NAMES_456 NAMES_765 # 1 1 4444, 333, 456 1 1 1 0 # 2 2 333 1 0 0 0 # 3 3 456, 765 0 0 1 1 </code></pre> If you want to see the underlying function that is called when you use those arguments, you can look at <code>splitstackshape:::charMat</code>, which takes a <code>list</code> generated by <code>strsplit</code> and creates a <code>matrix</code> from it. Calling the function directly would give you something like this: <pre class="prettyprint"><code>splitstackshape:::charMat( lapply(strsplit(as.character(mydata$NAMES), ","), function(x) gsub("^\\s+|\\s$", "", x))) # 333 4444 456 765 # [1,] 1 1 1 NA # [2,] 1 NA NA NA # [3,] NA NA 1 1 </code></pre>

Casting unique features in column to variable names and dummy coding original features into variables in R

Tags:

r

dplyr

reshape

apply

plyr

Having an issue with how to dummy code the following dataset.

Example data, lets say dataframe = mydata:

ID |     NAMES      |
-- | -------------- |
1  | 4444, 333, 456 |
2  | 333            |
3  | 456, 765       |

I'd like to cast only the unique variables in NAMES as column variables and code if each row has that variable or not i.e 1 or 0

Desired Output:

ID |     NAMES      | 4444 | 333 | 456 | 765 |
-- | -------------- |------|-----|-----|-----|
1  | 4444, 333, 456 |   1  |  1  |  1  |   0 |
2  | 333            |   0  |  1  |  0  |   0 |
3  | 456, 765       |   0  |  0  |  1  |   1 |

what I've done so far is created a vector of unique

split <- str_split(string = mydata$NAMES,pattern = ",")

vec <- unique(str_trim(unlist(split)))
remove <- ""
vec <- as.data.frame(vec[! vec %in% remove])
colnames(vec) <- "var"
vecRef <- as.vector(vec$var)

namesCast <- dcast(data = vec,formula = .~var)
namesCast <- nameCast[,2:ncol(namesCast)]

This yields a vector of unique NAMES with spaces/irregularities removed. From there I have no idea how to do the matching/dummy coding so any help would be greatly appreciated!

327

asked Dec 03 '14 15:12

moku

1 Answers

You can use cSplit_e from my "splitstackshape" package, like this:

library(splitstackshape)
cSplit_e(mydata, "NAMES", sep = ",", type = "character", fill = 0)
#   ID          NAMES NAMES_333 NAMES_4444 NAMES_456 NAMES_765
# 1  1 4444, 333, 456         1          1         1         0
# 2  2            333         1          0         0         0
# 3  3       456, 765         0          0         1         1

If you want to see the underlying function that is called when you use those arguments, you can look at splitstackshape:::charMat, which takes a list generated by strsplit and creates a matrix from it.

Calling the function directly would give you something like this:

splitstackshape:::charMat(
  lapply(strsplit(as.character(mydata$NAMES), ","), 
         function(x) gsub("^\\s+|\\s$", "", x)))
#      333 4444 456 765
# [1,]   1    1   1  NA
# [2,]   1   NA  NA  NA
# [3,]  NA   NA   1   1

128

answered Sep 23 '22 14:09

A5C1D2H2I1M1N2O1R2T1

Related questions
                            
                                R Generic solution to create 2*2 confusion matrix
                            
                                subset a matrix, and get NA if index is not valid
                            
                                xtable thead in html output
                            
                                Aligning text annotation in ggplot2
                            
                                R: Error in nrow[w] * ncol[w] : non-numeric argument to binary operator, while using neuralnet package
                            
                                caret's helper functions for feature selection: caretSBF and caretFuncts
                            
                                Disable Selectize Input Shiny
                            
                                Transform row data into column by certain row name in R
                            
                                How do I write a json array from R that has a sequence of lat and long?
                            
                                R: Save multiple plots from a file list into a single file (png or pdf or other format)
                            
                                How can I copy files from folders and subfolders to another folder in R?
                            
                                Removing gap between ggplot y-axis and first x-value
                            
                                forming and using Regular expressions in R
                            
                                How to assign order to elements in a column in R?
                            
                                R lattice xyplot doesn't match grid to axes' ticks (not multiplot)
                            
                                How to save boxplot to as to a variable?
                            
                                How to execute sql query files via RPostgreSQL
                            
                                Short caption fig.scap in knitr not working?
                            
                                R, issue with a Hierarchical clustering after a Multiple correspondence analysis
                            
                                Find day of year with the lubridate package in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With