Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regression trees or Random Forest regressor with categorical inputs

I have been trying to use a categorical inpust in a regression tree (or Random Forest Regressor) but sklearn keeps returning errors and asking for numerical inputs.

import sklearn as sk
MODEL = sk.ensemble.RandomForestRegressor(n_estimators=100)
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works

MODEL = sk.tree.DecisionTreeRegressor()
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works

To my understanding, categorical inputs should be possible in these methods without any conversion (e.g. WOE substitution).

Has anyone else had this difficulty?

thanks!

like image 733
jpsfer Avatar asked Nov 20 '13 11:11

jpsfer


2 Answers

scikit-learn has no dedicated representation for categorical variables (a.k.a factors in R), one possible solution is to encode the strings as int using LabelEncoder:

import numpy as np
from sklearn.preprocessing import LabelEncoder  
from sklearn.ensemble import RandomForestRegressor

X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)]) 
y = np.asarray([1,2.5,3,4])

# transform 1st column to numbers
X[:, 0] = LabelEncoder().fit_transform(X[:,0]) 

regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))

Output:

[[ 0.  1.  2.]
 [ 1.  2.  3.]
 [ 0.  3.  2.]
 [ 2.  1.  3.]]
[ 1.61333333  2.13666667  2.53333333  2.95333333]

But remember that this is a slight hack if a and b are independent categories and it only works with tree-based estimators. Why? Because b is not really bigger than a. The correct way would be to use the OneHotEncoder after the LabelEncoder or pd.get_dummies yielding two separate, one-hot encoded columns for X[:, 0].

import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)]) 
y = np.asarray([1,2.5,3,4])

# transform 1st column to numbers
import pandas as pd
X_0 = pd.get_dummies(X[:, 0]).values
X = np.column_stack([X_0, X[:, 1:]])

regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))
like image 88
Matt Avatar answered Sep 27 '22 17:09

Matt


You must dummy code by hand in python. I would suggest using pandas.get_dummies() for one hot encoding. For Boosted trees I have had success using factorize() to achieve Ordinal Encoding.

There is also a whole package for this sort of thing here.

For a more detailed explanation look in this Data Science Stack Exchange post.

like image 37
Keith Avatar answered Sep 27 '22 18:09

Keith