Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neural network with categorical variables (enum) as inputs

I'm trying to solve some machine-learning problems using neural networks, mostly with the NEAT evolution (NeuroEvolution of Augmented Topologies).

Some of my input variables are continuous, but some of them are of a categorical nature, like:

  • Species: {Lion,Leopard,Tiger,Jaguar}
  • Branches of Trade: {Health care,Insurances,Finance,IT,Advertising}

At first I wanted to model such a variable by mapping the categories to discrete numbers, like:

{Lion:1, Leopard:2, Tiger:3, Jaguar:4}

But I'm afraid this adds some kind of arbitrary topology on the variable. A Tiger is not the sum of a Lion and a Leopard.

What approaches to this problem are usually employed?

like image 611
cheesus Avatar asked Sep 16 '13 09:09

cheesus


People also ask

Can you use categorical variables in neural network?

Machine learning algorithms and deep learning neural networks require that input and output variables are numbers. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.

Can the neural network node handle categorical variables as is?

The answer is Yes! We can use neural networks to better represent our categorical variables in the form of embeddings.

How do you map categorical variables to a feature encoding?

In one hot encoding, for each level of a categorical feature, we create a new variable. Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category. These newly created binary features are known as Dummy variables.

Can you do machine learning with categorical variables?

Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.


1 Answers

Unfortunately there is no good solution, each leads to some kind of problems:

  • Your solution is adding the topology, as you mentioned; it may not be that bad, as NN can fit arbitrary functions and represent "ifs", but in many cases it will (as NN are often falling into some local minima).
  • You can encode your data in form of is_categorical_feature_i_equal_j, which won't induce any additional topology, but will grow the number of features quadratically. So instaed of "species" you get features "is_lion", "is_leopard", etc. and only one of them is equal 1 at the time
  • in case of large amount of data as compared to the possible categorical values (for example you have 10000 od data points, and only 10 possible categorical values) one can also split the problem into 10 independent ones, each trained on one particular value (so we have "neural network for lions" "neural network for jaguars" etc.)

These two first approaches are to "extreme" cases - one is very computationally cheap, but can lead to high bias, while the second introduces much complexity, but should not influence the classification process itself. The last one is rarely usable (due to assumption of small number of categorical values) yet quite reasonable in terms of machine learning.

Update

So many things changes in 8 years. Solution 2 is definitely the most popular one, and with growth of compute, wide adoption of neural networks, and support of sparse inputs, the costs is now negliegiable

like image 107
lejlot Avatar answered Sep 20 '22 14:09

lejlot