How to one hot encode several categorical variables in R

Tags:

I'm working on a prediction problem and I'm building a decision tree in R, I have several categorical variables and I'd like to one-hot encode them consistently in my training and testing set. I managed to do it on my training data with :

temps <- X_train tt <- subset(temps, select = -output) oh <- data.frame(model.matrix(~ . -1, tt), CLASS = temps$output)

But I can't find a way to apply the same encoding on my testing set, how can I do that?

653

asked Feb 06 '18 18:02

xeco

2 Answers

I recommend using the dummyVars function in the caret package:

customers <- data.frame(   id=c(10, 20, 30, 40, 50),   gender=c('male', 'female', 'female', 'male', 'female'),   mood=c('happy', 'sad', 'happy', 'sad','happy'),   outcome=c(1, 1, 0, 0, 0)) customers id gender  mood outcome 1 10   male happy       1 2 20 female   sad       1 3 30 female happy       0 4 40   male   sad       0 5 50 female happy       0   # dummify the data dmy <- dummyVars(" ~ .", data = customers) trsf <- data.frame(predict(dmy, newdata = customers)) trsf id gender.female gender.male mood.happy mood.sad outcome 1 10             0           1          1        0       1 2 20             1           0          0        1       1 3 30             1           0          1        0       0 4 40             0           1          0        1       0 5 50             1           0          1        0       0

example source

You apply the same procedure to both the training and validation sets.

154

answered Oct 10 '22 10:10

Esteban PS

Here's a simple solution to one-hot-encode your category using no packages.

Solution

model.matrix(~0+category)

It needs your categorical variable to be a factor. The factor levels must be the same in your training and test data, check with levels(train$category) and levels(test$category). It doesn't matter if some levels don't occur in your test set.

Example

Here's an example using the iris dataset.

data(iris) #Split into train and test sets. train <- sample(1:nrow(iris),100) test <- -1*train  iris[test,]      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species 34           5.5         4.2          1.4         0.2    setosa 106          7.6         3.0          6.6         2.1 virginica 112          6.4         2.7          5.3         1.9 virginica 127          6.2         2.8          4.8         1.8 virginica 132          7.9         3.8          6.4         2.0 virginica

model.matrix() creates a column for each level of the factor, even if it is not present in the data. Zero indicates it is not that level, one indicates it is. Adding the zero specifies that you do not want an intercept or reference level and is equivalent to -1.

oh_train <- model.matrix(~0+iris[train,'Species']) oh_test <- model.matrix(~0+iris[test,'Species'])  #Renaming the columns to be more concise. attr(oh_test, "dimnames")[[2]] <- levels(iris$Species)     setosa versicolor virginica 1      1          0         0 2      0          0         1 3      0          0         1 4      0          0         1 5      0          0         1

P.S. It's generally preferable to include all categories in training and test data. But that's none of my business.

answered Oct 10 '22 11:10

D A Wells

Related questions
                            
                                How do I create a list of vectors in Rcpp?
                            
                                Calculating weighted mean and standard deviation
                            
                                Combine a list of matrices to a single matrix by rows
                            
                                How to optimize for integer parameters (and other discontinuous parameter space) in R?
                            
                                Merging more than 2 dataframes in R by rownames
                            
                                Combining matrices into an array in R
                            
                                Include a javascript file in Shiny app
                            
                                How do I create a copy of a data frame in R
                            
                                How can I concatenate a vector? [duplicate]
                            
                                How to remove rows of a matrix by row name, rather than numerical index?
                            
                                Set a Data Frame Column as the Index of R data.frame object
                            
                                Replace <NA> in a factor column
                            
                                How to remove rows with any zero value
                            
                                Interleave lists in R
                            
                                How to retry a statement on error?
                            
                                R - image of a pixel matrix?
                            
                                How to find all functions in an R package?
                            
                                install curl and readr on R
                            
                                How to plot 3D scatter diagram using ggplot?
                            
                                Subset elements in a list based on a logical condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With