I am trying to add a calibration step in a sklearn pipeline to obtain a calibrated classifier and thus have more trustworthy probabilities in output.
So far I clumsily tried to insert a 'calibration' step using CalibratedClassifierCV along the lines of (silly example for reproducibility):
import sklearn.datasets
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
data = sklearn.datasets.fetch_20newsgroups(categories=['alt.atheism', 'sci.space'])
df = pd.DataFrame(data = np.c_[data['data'], data['target']])\
.rename({0:'text', 1:'class'}, axis = 'columns')
my_pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', SGDClassifier(loss='modified_huber')),
('calibrator', CalibratedClassifierCV(cv=5, method='isotonic'))
])
my_pipeline.fit(df['text'].values, df['class'].values)
but that doesn't work (at least not in this way). Does anyone have tips about how to properly do this?
The SGDClassifier
object should go into the CalibratedClassifierCV
's base_estimator
argument.
Your code should probably look something like this:
my_pipeline = Pipeline([
('vectorizer', TfidfVectorizer()),
('classifier', CalibratedClassifierCV(base_estimator=SGDClassifier(loss='modified_huber'), cv=5, method='isotonic'))
])
CalibratedClassifierCV
is a meta-estimator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With