I have been trying to use a categorical inpust in a regression tree (or Random Forest Regressor) but sklearn keeps returning errors and asking for numerical inputs.
import sklearn as sk
MODEL = sk.ensemble.RandomForestRegressor(n_estimators=100)
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works
MODEL = sk.tree.DecisionTreeRegressor()
MODEL.fit([('a',1,2),('b',2,3),('a',3,2),('b',1,3)], [1,2.5,3,4]) # does not work
MODEL.fit([(1,1,2),(2,2,3),(1,3,2),(2,1,3)], [1,2.5,3,4]) #works
To my understanding, categorical inputs should be possible in these methods without any conversion (e.g. WOE substitution).
Has anyone else had this difficulty?
thanks!
scikit-learn
has no dedicated representation for categorical variables (a.k.a factors in R), one possible solution is to encode the strings as int
using LabelEncoder
:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)])
y = np.asarray([1,2.5,3,4])
# transform 1st column to numbers
X[:, 0] = LabelEncoder().fit_transform(X[:,0])
regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))
Output:
[[ 0. 1. 2.]
[ 1. 2. 3.]
[ 0. 3. 2.]
[ 2. 1. 3.]]
[ 1.61333333 2.13666667 2.53333333 2.95333333]
But remember that this is a slight hack if a
and b
are independent categories and it only works with tree-based estimators. Why? Because b
is not really bigger than a
. The correct way would be to use the OneHotEncoder
after the LabelEncoder
or pd.get_dummies
yielding two separate, one-hot encoded columns for X[:, 0]
.
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
X = np.asarray([('a',1,2),('b',2,3),('a',3,2),('c',1,3)])
y = np.asarray([1,2.5,3,4])
# transform 1st column to numbers
import pandas as pd
X_0 = pd.get_dummies(X[:, 0]).values
X = np.column_stack([X_0, X[:, 1:]])
regressor = RandomForestRegressor(n_estimators=150, min_samples_split=2)
regressor.fit(X, y)
print(X)
print(regressor.predict(X))
You must dummy code by hand in python. I would suggest using pandas.get_dummies() for one hot encoding. For Boosted trees I have had success using factorize() to achieve Ordinal Encoding.
There is also a whole package for this sort of thing here.
For a more detailed explanation look in this Data Science Stack Exchange post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With