I'm working on a prediction problem and I'm building a decision tree in R, I have several categorical variables and I'd like to one-hot encode them consistently in my training and testing set. I managed to do it on my training data with :
temps <- X_train tt <- subset(temps, select = -output) oh <- data.frame(model.matrix(~ . -1, tt), CLASS = temps$output)
But I can't find a way to apply the same encoding on my testing set, how can I do that?
When working with categorical variables, you may use the group_by() method to divide the data into subgroups based on the variable's distinct categories. You can group by a single variable or by giving in multiple variable names to group by several variables.
Target encoding is also very simple, where the encoded value of each value of a categorical variable is simply the mean of the target variable. The mean of the target is obtained by using the aggregate R function. Some noise can be added to the encoded value by specifying the sigma argument.
To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);
This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods.
R has “one-hot” encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere. For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:
R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere. For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:
So if you have 27 distinct values of a categorical variable, then 5 columns are sufficient to encode this variable - as 5-digit binary numbers can store any value from 0 to 31. An implementation is provided below using the binaryLogic package.
One-hot encoding can be applied to the representation of integers. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value. For instance, we code the variable of colors,
I recommend using the dummyVars function in the caret package:
customers <- data.frame( id=c(10, 20, 30, 40, 50), gender=c('male', 'female', 'female', 'male', 'female'), mood=c('happy', 'sad', 'happy', 'sad','happy'), outcome=c(1, 1, 0, 0, 0)) customers id gender mood outcome 1 10 male happy 1 2 20 female sad 1 3 30 female happy 0 4 40 male sad 0 5 50 female happy 0 # dummify the data dmy <- dummyVars(" ~ .", data = customers) trsf <- data.frame(predict(dmy, newdata = customers)) trsf id gender.female gender.male mood.happy mood.sad outcome 1 10 0 1 1 0 1 2 20 1 0 0 1 1 3 30 1 0 1 0 0 4 40 0 1 0 1 0 5 50 1 0 1 0 0
example source
You apply the same procedure to both the training and validation sets.
Here's a simple solution to one-hot-encode your category using no packages.
model.matrix(~0+category)
It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category)
and levels(test$category)
. It doesn't matter if some levels don't occur in your test set.
Here's an example using the iris dataset.
data(iris) #Split into train and test sets. train <- sample(1:nrow(iris),100) test <- -1*train iris[test,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 34 5.5 4.2 1.4 0.2 setosa 106 7.6 3.0 6.6 2.1 virginica 112 6.4 2.7 5.3 1.9 virginica 127 6.2 2.8 4.8 1.8 virginica 132 7.9 3.8 6.4 2.0 virginica
model.matrix()
creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.
oh_train <- model.matrix(~0+iris[train,'Species']) oh_test <- model.matrix(~0+iris[test,'Species']) #Renaming the columns to be more concise. attr(oh_test, "dimnames")[[2]] <- levels(iris$Species) setosa versicolor virginica 1 1 0 0 2 0 0 1 3 0 0 1 4 0 0 1 5 0 0 1
P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With