I'm trying to solve some machine-learning problems using neural networks, mostly with the NEAT
evolution (NeuroEvolution of Augmented Topologies).
Some of my input variables are continuous, but some of them are of a categorical nature, like:
At first I wanted to model such a variable by mapping the categories to discrete numbers, like:
{Lion:1, Leopard:2, Tiger:3, Jaguar:4}
But I'm afraid this adds some kind of arbitrary topology on the variable. A Tiger is not the sum of a Lion and a Leopard.
What approaches to this problem are usually employed?
Machine learning algorithms and deep learning neural networks require that input and output variables are numbers. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.
The answer is Yes! We can use neural networks to better represent our categorical variables in the form of embeddings.
In one hot encoding, for each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category. These newly created binary features are known as Dummy variables.
Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.
Unfortunately there is no good solution, each leads to some kind of problems:
is_categorical_feature_i_equal_j
, which won't induce any additional topology, but will grow the number of features quadratically. So instaed of "species" you get features "is_lion", "is_leopard", etc. and only one of them is equal 1
at the timeThese two first approaches are to "extreme" cases - one is very computationally cheap, but can lead to high bias, while the second introduces much complexity, but should not influence the classification process itself. The last one is rarely usable (due to assumption of small number of categorical values) yet quite reasonable in terms of machine learning.
So many things changes in 8 years. Solution 2 is definitely the most popular one, and with growth of compute, wide adoption of neural networks, and support of sparse inputs, the costs is now negliegiable
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With