How to handle categorical variables in sklearn GradientBoostingClassifier?

Tags:

I am attempting to train models with GradientBoostingClassifier using categorical variables.

The following is a primitive code sample, just for trying to input categorical variables into GradientBoostingClassifier.

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
import pandas

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]

X_train = pandas.DataFrame(X_train)

# Insert fake categorical variable. 
# Just for testing in GradientBoostingClassifier.
X_train[0] = ['a']*40 + ['b']*40

# Model.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

The following error appears:

ValueError: could not convert string to float: 'b'

From what I gather, it seems that One Hot Encoding on categorical variables is required before GradientBoostingClassifier can build the model.

Can GradientBoostingClassifier build models using categorical variables without having to do one hot encoding?

R gbm package is capable of handling the sample data above. I'm looking for a Python library with equivalent capability.

880

asked Jul 11 '14 21:07

user1045085

1 Answers

pandas.get_dummies or statsmodels.tools.tools.categorical can be used to convert categorical variables to a dummy matrix. We can then merge the dummy matrix back to the training data.

Below is the example code from the question with the above procedure carried out.

from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve,auc
from statsmodels.tools import categorical
import numpy as np

iris = datasets.load_iris()
# Use only data for 2 classes.
X = iris.data[(iris.target==0) | (iris.target==1)]
Y = iris.target[(iris.target==0) | (iris.target==1)]

# Class 0 has indices 0-49. Class 1 has indices 50-99.
# Divide data into 80% training, 20% testing.
train_indices = list(range(40)) + list(range(50,90))
test_indices = list(range(40,50)) + list(range(90,100))
X_train = X[train_indices]
X_test = X[test_indices]
y_train = Y[train_indices]
y_test = Y[test_indices]


###########################################################################
###### Convert categorical variable to matrix and merge back with training
###### data.

# Fake categorical variable.
catVar = np.array(['a']*40 + ['b']*40)
catVar = categorical(catVar, drop=True)
X_train = np.concatenate((X_train, catVar), axis = 1)

catVar = np.array(['a']*10 + ['b']*10)
catVar = categorical(catVar, drop=True)
X_test = np.concatenate((X_test, catVar), axis = 1)
###########################################################################

# Model and test.
clf = GradientBoostingClassifier(learning_rate=0.01,max_depth=8,n_estimators=50).fit(X_train, y_train)

prob = clf.predict_proba(X_test)[:,1]   # Only look at P(y==1).

fpr, tpr, thresholds = roc_curve(y_test, prob)
roc_auc_prob = auc(fpr, tpr)

print(prob)
print(y_test)
print(roc_auc_prob)

Thanks to Andreas Muller for instructing that pandas Dataframe should not be used for scikit-learn estimators.

answered Oct 07 '22 15:10

user1045085

Related questions
                            
                                Edit-and-continue while debugging under Python Visual Studio?
                            
                                What's the numpy equivalent of python's zip(*)?
                            
                                openpyxl python3 -- formatting whole rows ellicits strange behavior
                            
                                Speed up sampling of kernel estimate
                            
                                Using url_for across blueprints
                            
                                Boto S3 throws httplib.IncompleteRead occasionally
                            
                                How to pass additional arguments to custom python sorting function
                            
                                Using dateutil.parser to parse a date in another language
                            
                                Pandas Handling Missing Values when going from Data Frame to Pivot Table
                            
                                Reverse Levenshtein distance
                            
                                How to give Matplolib imshow plot colorbars a label
                            
                                What's the difference between kmeans and kmeans2 in scipy?
                            
                                numpy OpenBLAS set maximum number of threads
                            
                                Is there a reason to import the string module in Python?
                            
                                Check if data available in sockets in python
                            
                                Alter the style of all cells with openpyxl
                            
                                Heroku Python/Django applications all simultaneously developed ImportError
                            
                                Implementing complex number comparison in Python?
                            
                                How to get an XPath from selenium webelement or from lxml?
                            
                                Why does object.__new__ with arguments work fine in Python 2.x and not in Python 3.3+?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle categorical variables in sklearn GradientBoostingClassifier?

Tags:

python

machine-learning

scikit-learn

decision-tree

ensemble-learning

user1045085

People also ask

1 Answers

user1045085

Recent Activity

Donate For Us