Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert column with pipe delimited data into dummy variables [duplicate]

Tags:

r

delimiter

I'm interested in taking a column of a data.frame where the values in the column are pipe delimited and creating dummy variables from the pipe-delimited values.

For example:

Let's say we start with

df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim|", "Jim|Steve|Ben"))

> df
              a
1 Ben|Chris|Jim
2 Ben|Greg|Jim
3 Jim|Steve|Ben

I'm interested in ending up with:

df2 = data.frame(Ben = c(1, 1, 1), Chris = c(1, 0, 0), Jim = c(1, 1, 1), Greg = c(0, 1, 0), 
                 Steve = c(0, 0, 1))
> df2
  Ben Chris Jim Greg Steve
1   1     1   1    0     0
2   1     0   1    1     0
3   1     0   1    0     1

I don't know in advance how many potential values there are within the field. In the example above, the variable "a" can include 1 value or 10 values. Assume it is a reasonable number (i.e., < 100 possible values).

Any good ways to do this?

like image 210
dreww2 Avatar asked Sep 13 '16 02:09

dreww2


People also ask

How do I convert an Excel file to a pipe delimited file?

This article will demonstrate how to convert an Excel file to a pipe delimited text file. 1. In the Control Panel of your computer, adjust the view to View by Large Icons, and then select Region. 2. Select Additional Settings.

What is the conversion of categorical variables into dummy variables?

The conversion of Categorical Variables into Dummy Variables leads to the formation of the two-dimensional binary matrix where each column represents a particular category. The following example will further clarify the process of conversion. The above data set comprises four categorical columns: OUTLOOK, TEMPERATURE, HUMIDITY, WINDY.

How do you compare two columns with different dummy variables?

So, in the data set that contains the Dummy Variables, the column WINDY is replaced by two columns which each represent the categories: YES and NO. Now comparing the rows of the columns YES and NO with WINDY, we mark 0 for YES where it is absent and 1 where it is present.

How do I convert a categorical column to a data frame?

Using this approach, we use LabelBinarizer from sklearn which converts one categorical column to a data frame with dummy variables at a time. This data frame can then be appended to the main data frame in the case of there being more than one Categorical column.


3 Answers

Another way is using cSplit_e from splitstackshape package.

splitting the dataframe by column a and fill it by 0 and drop the original column.

library(splitstackshape)
cSplit_e(df, "a", "|", type = "character", fill = 0, drop = T)

#   a_Ben a_Chris a_Greg a_Jim a_Steve
#1     1       1      0     1       0
#2     1       0      1     1       0
#3     1       0      0     1       1
like image 120
Ronak Shah Avatar answered Oct 12 '22 16:10

Ronak Shah


Here is a method in base R

# get unique set of names
myNames <- unique(unlist(strsplit(as.character(df$a), split="\\|")))
# get indicator data.frame
setNames(data.frame(lapply(myNames, function(i) as.integer(grepl(i, df$a)))), myNames)

which returns

Ben Chris Jim Greg Steve
1   1     1   1    0     0
2   1     0   1    1     0
3   1     0   1    0     1

The first line uses strsplit to produce a list of names split on the pipe "|", unlist and unique produce a vector of unique names. The second line runs through these names with lapply, and uses grepl to search for the names, which as.integer converts into binary integers. The returned list is converted into a data.frame and given column names with setNames.

like image 21
lmo Avatar answered Oct 12 '22 14:10

lmo


Here is one option using dplyr and tidyr:

library(dplyr)
library(tidyr)
df %>% tibble::rownames_to_column(var = "id") %>% 
       mutate(a = strsplit(as.character(a), "\\|")) %>% 
       unnest() %>% table()

#    a
# id  Ben Chris Greg Jim Steve
#  1   1     1    0   1     0
#  2   1     0    1   1     0
#  3   1     0    0   1     1

The analogue in base R is:

df$a <- as.character(df$a)
s    <- strsplit(df$a, "|", fixed=TRUE)
table(id = rep(1:nrow(df), lengths(s)), v = unlist(s))

Data:

df = data.frame(a = c("Ben|Chris|Jim", "Ben|Greg|Jim", "Jim|Steve|Ben"))
like image 45
Psidom Avatar answered Oct 12 '22 14:10

Psidom