I have a number of CSV files with columns such as gender, age, diagnosis, etc.
Currently, they are coded as such:
ID, gender, age, diagnosis
1, male, 42, asthma
1, male, 42, anxiety
2, male, 19, asthma
3, female, 23, diabetes
4, female, 61, diabetes
4, female, 61, copd
The goal is to transform this data into this target format:
Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'
ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0
2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0
3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0
4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1
I've attempted using reshape2's dcast()
function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:
data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)
ID, male19, male23, male42, male61, female19, female23, female42, female61
1, 0, 0, 1, 0, 0, 0, 0, 0
2, 1, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 1, 0, 0
4, 0, 0, 0, 0, 0, 0, 0, 1
Seeing as this is a fairly common task in machine learning data preparation, I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.
To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .
The conversion of Categorical Variables into Dummy Variables leads to the formation of the two-dimensional binary matrix where each column represents a particular category. The following example will further clarify the process of conversion.
To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);
To convert a column to numeric in R, use the as. numeric() function. The as. numeric() is a built-in R function that returns a numeric value or converts any value to a numeric value.
You need a melt
/dcast
combination here (which called recast
) in order to convert all columns into one column and avoid combinations
library(reshape2)
recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
# ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1 1 0 0 1 0 1 1 0 0 0 1
# 2 2 1 0 0 0 0 1 0 0 0 1
# 3 3 0 1 0 0 0 0 0 1 1 0
# 4 4 0 0 0 1 0 0 1 1 1 0
As per your Sidenote, you can add variable
here in order to get the names added too
recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
# ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd
# 1 1 0 1 0 0 1 0 1 1 0
# 2 2 0 1 1 0 0 0 0 1 0
# 3 3 1 0 0 1 0 0 0 0 0
# 4 4 1 0 0 0 0 1 0 0 1
# diagnosis_diabetes
# 1 0
# 2 0
# 3 1
# 4 1
There is a function in the caret
package to "dummify" data.
library(caret)
library(dplyr)
predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)
A base R
option would be
(!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
# values
#ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1 0 0 1 0 1 1 0 0 0 1
# 2 1 0 0 0 0 1 0 0 0 1
# 3 0 1 0 0 0 0 0 1 1 0
# 4 0 0 0 1 0 0 1 1 1 0
If you need the original name as well
(!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
# Val
#ID age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
#1 0 0 1 0 1 1
#2 1 0 0 0 0 1
#3 0 1 0 0 0 0
#4 0 0 0 1 0 0
# Val
#ID diagnosis_copd diagnosis_diabetes gender_female gender_male
#1 0 0 0 1
#2 0 0 0 1
#3 0 1 1 0
#4 1 1 1 0
df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male",
"male", "male", "female", "female", "female"), age = c(42L, 42L,
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma",
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender",
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")
Using reshape
from base R:
d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")
new.dat <- cbind(ID=d["ID"],
g[,grepl("_", names(g))],
a[,grepl("_", names(a))],
d[,grepl("_", names(d))])
# convert factors columns to character (if necessary)
# taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)
new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1
# ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
#1 1 1 0 1 0 0 0 1
#3 2 1 0 0 1 0 0 1
#4 3 0 1 0 0 1 0 0
#5 4 0 1 0 0 0 1 0
# diagnosis_anxiety diagnosis_diabetes diagnosis_copd
#1 1 0 0
#3 0 0 0
#4 0 1 0
#5 0 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With