I'm interested in taking a column of a data.frame where the values in the column are pipe delimited and creating dummy variables from the pipe-delimited values.
For example:
Let's say we start with
df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim|", "Jim|Steve|Ben"))
> df
a
1 Ben|Chris|Jim
2 Ben|Greg|Jim
3 Jim|Steve|Ben
I'm interested in ending up with:
df2 = data.frame(Ben = c(1, 1, 1), Chris = c(1, 0, 0), Jim = c(1, 1, 1), Greg = c(0, 1, 0),
Steve = c(0, 0, 1))
> df2
Ben Chris Jim Greg Steve
1 1 1 1 0 0
2 1 0 1 1 0
3 1 0 1 0 1
I don't know in advance how many potential values there are within the field. In the example above, the variable "a" can include 1 value or 10 values. Assume it is a reasonable number (i.e., < 100 possible values).
Any good ways to do this?
This article will demonstrate how to convert an Excel file to a pipe delimited text file. 1. In the Control Panel of your computer, adjust the view to View by Large Icons, and then select Region. 2. Select Additional Settings.
The conversion of Categorical Variables into Dummy Variables leads to the formation of the two-dimensional binary matrix where each column represents a particular category. The following example will further clarify the process of conversion. The above data set comprises four categorical columns: OUTLOOK, TEMPERATURE, HUMIDITY, WINDY.
So, in the data set that contains the Dummy Variables, the column WINDY is replaced by two columns which each represent the categories: YES and NO. Now comparing the rows of the columns YES and NO with WINDY, we mark 0 for YES where it is absent and 1 where it is present.
Using this approach, we use LabelBinarizer from sklearn which converts one categorical column to a data frame with dummy variables at a time. This data frame can then be appended to the main data frame in the case of there being more than one Categorical column.
Another way is using cSplit_e
from splitstackshape
package.
splitting the dataframe by column a
and fill
it by 0 and drop
the original column.
library(splitstackshape)
cSplit_e(df, "a", "|", type = "character", fill = 0, drop = T)
# a_Ben a_Chris a_Greg a_Jim a_Steve
#1 1 1 0 1 0
#2 1 0 1 1 0
#3 1 0 0 1 1
Here is a method in base R
# get unique set of names
myNames <- unique(unlist(strsplit(as.character(df$a), split="\\|")))
# get indicator data.frame
setNames(data.frame(lapply(myNames, function(i) as.integer(grepl(i, df$a)))), myNames)
which returns
Ben Chris Jim Greg Steve
1 1 1 1 0 0
2 1 0 1 1 0
3 1 0 1 0 1
The first line uses strsplit
to produce a list of names split on the pipe "|", unlist
and unique
produce a vector of unique names. The second line runs through these names with lapply
, and uses grepl
to search for the names, which as.integer
converts into binary integers. The returned list is converted into a data.frame and given column names with setNames
.
Here is one option using dplyr
and tidyr
:
library(dplyr)
library(tidyr)
df %>% tibble::rownames_to_column(var = "id") %>%
mutate(a = strsplit(as.character(a), "\\|")) %>%
unnest() %>% table()
# a
# id Ben Chris Greg Jim Steve
# 1 1 1 0 1 0
# 2 1 0 1 1 0
# 3 1 0 0 1 1
The analogue in base R is:
df$a <- as.character(df$a)
s <- strsplit(df$a, "|", fixed=TRUE)
table(id = rep(1:nrow(df), lengths(s)), v = unlist(s))
Data:
df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim", "Jim|Steve|Ben"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With