I have a number of CSV files with columns such as gender, age, diagnosis, etc. Currently, they are coded as such: <pre class="prettyprint"><code>ID, gender, age, diagnosis 1, male, 42, asthma 1, male, 42, anxiety 2, male, 19, asthma 3, female, 23, diabetes 4, female, 61, diabetes 4, female, 61, copd </code></pre> The goal is to transform this data into this target format: Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.' <pre class="prettyprint"><code>ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0 2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0 3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0 4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1 </code></pre> I've attempted using reshape2's <code>dcast()</code> function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender: <pre class="prettyprint"><code>data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length) ID, male19, male23, male42, male61, female19, female23, female42, female61 1, 0, 0, 1, 0, 0, 0, 0, 0 2, 1, 0, 0, 0, 0, 0, 0, 0 3, 0, 0, 0, 0, 0, 1, 0, 0 4, 0, 0, 0, 0, 0, 0, 0, 1 </code></pre> Seeing as this is a fairly common task in machine learning data preparation, I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.

You need a <code>melt</code>/<code>dcast</code> combination here (which called <code>recast</code>) in order to convert all columns into one column and avoid combinations <pre class="prettyprint"><code>library(reshape2) recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L) # ID 19 23 42 61 anxiety asthma copd diabetes female male # 1 1 0 0 1 0 1 1 0 0 0 1 # 2 2 1 0 0 0 0 1 0 0 0 1 # 3 3 0 1 0 0 0 0 0 1 1 0 # 4 4 0 0 0 1 0 0 1 1 1 0 </code></pre> <hr> As per your Sidenote, you can add <code>variable</code> here in order to get the names added too <pre class="prettyprint"><code>recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L) # ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd # 1 1 0 1 0 0 1 0 1 1 0 # 2 2 0 1 1 0 0 0 0 1 0 # 3 3 1 0 0 1 0 0 0 0 0 # 4 4 1 0 0 0 0 1 0 0 1 # diagnosis_diabetes # 1 0 # 2 0 # 3 1 # 4 1 </code></pre>

There is a function in the <code>caret</code> package to "dummify" data. <pre class="prettyprint"><code>library(caret) library(dplyr) predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df) </code></pre>

Converting Column Values into Their Own Binary Encoded Columns (Dummy Variables)

Tags:

r

sparse-matrix

reshape2

I have a number of CSV files with columns such as gender, age, diagnosis, etc.

Currently, they are coded as such:

Click to copy

ID, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd

The goal is to transform this data into this target format:

Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'

Click to copy

ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1,  1,    0,      1,  0,  0,  0,  1,      1,       0,        0
2,  1,    0,      0,  1,  0,  0,  1,      0,       0,        0
3,  0,    1,      0,  0,  1,  0,  0,      0,       1,        0
4,  0,    1,      0,  0,  0,  1,  0,      0,       1,        1

I've attempted using reshape2's dcast() function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:

Click to copy

data.train  <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)

ID, male19, male23, male42, male61, female19, female23, female42, female61
1,  0,      0,      1,      0,      0,        0,        0,        0
2,  1,      0,      0,      0,      0,        0,        0,        0
3,  0,      0,      0,      0,      0,        1,        0,        0
4,  0,      0,      0,      0,      0,        0,        0,        1

Seeing as this is a fairly common task in machine learning data preparation, I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.

820

asked May 16 '15 20:05

Greenstick

4 Answers

You need a melt/dcast combination here (which called recast) in order to convert all columns into one column and avoid combinations

Click to copy

library(reshape2)
recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
#   ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1  1  0  0  1  0       1      1    0        0      0    1
# 2  2  1  0  0  0       0      1    0        0      0    1
# 3  3  0  1  0  0       0      0    0        1      1    0
# 4  4  0  0  0  1       0      0    1        1      1    0

As per your Sidenote, you can add variable here in order to get the names added too

Click to copy

recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
#   ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd
# 1  1             0           1      0      0      1      0                 1                1              0
# 2  2             0           1      1      0      0      0                 0                1              0
# 3  3             1           0      0      1      0      0                 0                0              0
# 4  4             1           0      0      0      0      1                 0                0              1
#   diagnosis_diabetes
# 1                  0
# 2                  0
# 3                  1
# 4                  1

146

answered Oct 24 '22 09:10

David Arenburg

There is a function in the caret package to "dummify" data.

Click to copy

library(caret)
library(dplyr)
predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)

answered Oct 24 '22 10:10

Steven Beaupré

A base R option would be

Click to copy

 (!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
 #     values
 #ID  19 23 42 61 anxiety asthma copd diabetes female male
 # 1  0  0  1  0       1      1    0        0      0    1
 # 2  1  0  0  0       0      1    0        0      0    1
 # 3  0  1  0  0       0      0    0        1      1    0
 # 4  0  0  0  1       0      0    1        1      1    0

If you need the original name as well

Click to copy

 (!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
 #   Val
 #ID  age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
 #1      0      0      1      0                 1                1
 #2      1      0      0      0                 0                1
 #3      0      1      0      0                 0                0
 #4      0      0      0      1                 0                0
 #  Val
 #ID  diagnosis_copd diagnosis_diabetes gender_female gender_male
 #1              0                  0             0           1
 #2              0                  0             0           1
 #3              0                  1             1           0
 #4              1                  1             1           0

data

Click to copy

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male", 
"male", "male", "female", "female", "female"), age = c(42L, 42L, 
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma", 
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender", 
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")

answered Oct 24 '22 10:10

akrun

Using reshape from base R:

Click to copy

d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")


new.dat <- cbind(ID=d["ID"],
    g[,grepl("_", names(g))],
    a[,grepl("_", names(a))],
    d[,grepl("_", names(d))])

# convert factors columns to character (if necessary)
# taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)

new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1

#  ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
#1  1           1             0      1      0      0      0                1
#3  2           1             0      0      1      0      0                1
#4  3           0             1      0      0      1      0                0
#5  4           0             1      0      0      0      1                0
#  diagnosis_anxiety diagnosis_diabetes diagnosis_copd
#1                 1                  0              0
#3                 0                  0              0
#4                 0                  1              0
#5                 0                  1              1

answered Oct 24 '22 10:10

Jota

Related questions
                            
                                Efficiently average the second column by intervals defined by the first column
                            
                                Which algorithm I can use to find common adjacent words/ pattern recognition?
                            
                                retrieve row and column name of particular cell in R
                            
                                for() loop step width
                            
                                Initialize a list of matrices in R
                            
                                How to install the fftw3 package of R in ubuntu 12.04?
                            
                                How can I print a table in R with ascii, html, or markdown formatting?
                            
                                "package ‘mgcv’ could not be loaded" only in RStudio
                            
                                Dynamic arguments to expand.grid
                            
                                How to subset data.frames stored in a list?
                            
                                How to remove empty columns in R?
                            
                                Remove zeros in the start and end of a vector
                            
                                Specifying the scale for the density in ggplot2's stat_density2d
                            
                                Function/instruction to count number of times a value has already been seen
                            
                                The fastest way to convert numeric to character in R
                            
                                How do I use a macro variable in R? (Similar to %LET in SAS)
                            
                                Setting *only* column names in Rcpp
                            
                                Add axis tick-marks on top and to the right to a ggplot?
                            
                                Extract text in parentheses in R
                            
                                Efficient Way to Incrementally Count Unique Data Points in Data Frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Converting Column Values into Their Own Binary Encoded Columns (Dummy Variables)

Tags:

r

sparse-matrix

reshape2

Greenstick

People also ask

4 Answers

David Arenburg

Steven Beaupré

data

akrun

Jota

Recent Activity

Donate For Us