Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to one hot encode several categorical variables in R

I'm working on a prediction problem and I'm building a decision tree in R, I have several categorical variables and I'd like to one-hot encode them consistently in my training and testing set. I managed to do it on my training data with :

temps <- X_train tt <- subset(temps, select = -output) oh <- data.frame(model.matrix(~ . -1, tt), CLASS = temps$output) 

But I can't find a way to apply the same encoding on my testing set, how can I do that?

like image 653
xeco Avatar asked Feb 06 '18 18:02

xeco


People also ask

How do I group categorical variables in R?

When working with categorical variables, you may use the group_by() method to divide the data into subgroups based on the variable's distinct categories. You can group by a single variable or by giving in multiple variable names to group by several variables.

How do I encode categorical data in R?

Target encoding is also very simple, where the encoded value of each value of a categorical variable is simply the mean of the target variable. The mean of the target is obtained by using the aggregate R function. Some noise can be added to the encoded value by specifying the sigma argument.

How do I convert multiple categorical variables to dummy variables in R?

To convert category variables to dummy variables in tidyverse, use the spread() method. To do so, use the spread() function with three arguments: key, which is the column to convert into categorical values, in this case, “Reporting Airline”; value, which is the value you want to set the key to (in this case “dummy”);

Which is the best way to encode categorical variables?

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods.

Does R have “one-hot” encoding?

R has “one-hot” encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere. For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

Where is one-hot encoding used in linear regression?

R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere. For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

How many columns does it take to encode a categorical variable?

So if you have 27 distinct values of a categorical variable, then 5 columns are sufficient to encode this variable - as 5-digit binary numbers can store any value from 0 to 31. An implementation is provided below using the binaryLogic package.

What is one hot encoding in C++?

One-hot encoding can be applied to the representation of integers. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value. For instance, we code the variable of colors,


2 Answers

I recommend using the dummyVars function in the caret package:

customers <- data.frame(   id=c(10, 20, 30, 40, 50),   gender=c('male', 'female', 'female', 'male', 'female'),   mood=c('happy', 'sad', 'happy', 'sad','happy'),   outcome=c(1, 1, 0, 0, 0)) customers id gender  mood outcome 1 10   male happy       1 2 20 female   sad       1 3 30 female happy       0 4 40   male   sad       0 5 50 female happy       0   # dummify the data dmy <- dummyVars(" ~ .", data = customers) trsf <- data.frame(predict(dmy, newdata = customers)) trsf id gender.female gender.male mood.happy mood.sad outcome 1 10             0           1          1        0       1 2 20             1           0          0        1       1 3 30             1           0          1        0       0 4 40             0           1          0        1       0 5 50             1           0          1        0       0 

example source

You apply the same procedure to both the training and validation sets.

like image 154
Esteban PS Avatar answered Oct 10 '22 10:10

Esteban PS


Here's a simple solution to one-hot-encode your category using no packages.

Solution

model.matrix(~0+category)

It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category) and levels(test$category). It doesn't matter if some levels don't occur in your test set.

Example

Here's an example using the iris dataset.

data(iris) #Split into train and test sets. train <- sample(1:nrow(iris),100) test <- -1*train  iris[test,]      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 34           5.5         4.2          1.4         0.2    setosa 106          7.6         3.0          6.6         2.1 virginica 112          6.4         2.7          5.3         1.9 virginica 127          6.2         2.8          4.8         1.8 virginica 132          7.9         3.8          6.4         2.0 virginica 

model.matrix() creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.

oh_train <- model.matrix(~0+iris[train,'Species']) oh_test <- model.matrix(~0+iris[test,'Species'])  #Renaming the columns to be more concise. attr(oh_test, "dimnames")[[2]] <- levels(iris$Species)     setosa versicolor virginica 1      1          0         0 2      0          0         1 3      0          0         1 4      0          0         1 5      0          0         1 

P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.

like image 38
D A Wells Avatar answered Oct 10 '22 11:10

D A Wells