I have a simple sklearn class I would like to use as part of an sklearn pipeline. This class just takes a pandas dataframe X_DF
and a categorical column name, and calls pd.get_dummies
to return the dataframe with the column turned into a matrix of dummy variables...
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
class dummy_var_encoder(TransformerMixin, BaseEstimator):
'''Convert selected categorical column to (set of) dummy variables
'''
def __init__(self, column_to_dummy='default_col_name'):
self.column = column_to_dummy
print self.column
def fit(self, X_DF, y=None):
return self
def transform(self, X_DF):
''' Update X_DF to have set of dummy-variables instead of orig column'''
# convert self-attribute to local var for ease of stepping through function
column = self.column
# add columns for new dummy vars, and drop original categorical column
dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)
new_DF = pd.concat([X_DF[column], dummy_matrix], axis=1)
return new_DF
Now using this transformer on it's own to fit/transform, I get output as expected. For some toy data as below:
from sklearn import datasets
# Load toy data
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Series(iris.target, name='y')
# Create Arbitrary categorical features
X['category_1'] = pd.cut(X['sepal length (cm)'],
bins=3,
labels=['small', 'medium', 'large'])
X['category_2'] = pd.cut(X['sepal width (cm)'],
bins=3,
labels=['small', 'medium', 'large'])
...my dummy encoder produces the correct output:
encoder = dummy_var_encoder(column_to_dummy = 'category_1')
encoder.fit(X)
encoder.transform(X).iloc[15:21,:]
category_1
category_1 category_1_small category_1_medium category_1_large
15 medium 0 1 0
16 small 1 0 0
17 small 1 0 0
18 medium 0 1 0
19 small 1 0 0
20 small 1 0 0
However, when I call the same transformer from an sklearn pipeline as defined below:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV
# Define Pipeline
clf = LogisticRegression(penalty='l1')
pipeline_steps = [('dummy_vars', dummy_var_encoder()),
('clf', clf)
]
pipeline = Pipeline(pipeline_steps)
# Define hyperparams try for dummy-encoder and classifier
# Fit 4 models - try dummying category_1 vs category_2, and using l1 vs l2 penalty in log-reg
param_grid = {'dummy_vars__column_to_dummy': ['category_1', 'category_2'],
'clf__penalty': ['l1', 'l2']
}
# Define full model search process
cv_model_search = GridSearchCV(pipeline,
param_grid,
scoring='accuracy',
cv = KFold(),
refit=True,
verbose = 3)
All's well until I fit the pipeline, at which point I get an error from the dummy encoder:
cv_model_search.fit(X,y=y)
In [101]: cv_model_search.fit(X,y=y) Fitting 3 folds for each of 4 candidates, totalling 12 fits
None None None None [CV] dummy_vars__column_to_dummy=category_1, clf__penalty=l1 .........
Traceback (most recent call last):
File "", line 1, in cv_model_search.fit(X,y=y)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 638, in fit cv.split(X, y, groups)))
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 779, in call while self.dispatch_one_batch(iterator):
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch self._dispatch(tasks)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async result = ImmediateResult(func)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in init self.results = batch()
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in call return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 437, in _fit_and_score estimator.fit(X_train, y_train, **fit_params)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 257, in fit Xt, fit_params = self._fit(X, y, **fit_params)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 222, in _fit **fit_params_steps[name])
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 362, in call return self.func(*args, **kwargs)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 589, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/base.py", line 521, in fit_transform return self.fit(X, y, **fit_params).transform(X)
File "", line 21, in transform dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py", line 1964, in getitem return self._getitem_column(key)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py", line 1971, in _getitem_column return self._get_item_cache(key)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/generic.py", line 1645, in _get_item_cache values = self._data.get(item)
File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/internals.py", line 3599, in get raise ValueError("cannot label index with a null key")
ValueError: cannot label index with a null key
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.
For our transformer to work smoothly with Scikit-Learn, we should have three methods: fit() transform() fit_transform.
The sklearn is a Python-based machine learning package that provides a collection of various data transformations for modifying the data as per need. Many simple data cleaning processes, such as deleting columns etc, are often done manually on the data, so we need to use custom code.
Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
The trace is telling you exactly what went wrong. Learning to diagnose the trace is really quite invaluable especially when you are inheriting from libraries you might not have a complete understanding of.
Now, I have done a fair bit of inheriting in sklearn myself and I can tell you without a doubt GridSearchCV
is going to give you some trouble if the type of data input into your fit
or fit_transform
methods are not NumPy arrays. As Vivek mentioned in his comment the X getting passed to your fit method is no longer a DataFrame. But let's take a look at the trace first.
ValueError: cannot label index with a null key
While Vivek is correct about the NumPy array you have another problem here. The actual error you get is that the value of column
in your fit method is None. If you were to look at your encoder
object above you would see the __repr__
method outputs the following:
dummy_var_encoder(column_to_dummy=None)
When using Pipeline
, this param is what gets initialized and passed along to GridSearchCV
. This is a behavior that can be seen throughout cross validation and search methods as well, and having attributes with different names from the input parameter causes issues like this. Fixing this will start you down the right path.
Modifying the __init__
method as such will solve this specific issue:
def __init__(self, column='default_col_name'):
self.column = column
print(self.column)
However, once you have done this the issue Vivek mentioned will rear it's head and you will have to deal with that. This is something I have run into before, though not with DataFrames specifically. I came up with a solution in Use sklearn GridSearchCV
on custom class whose fit method takes 3 arguments. Basically I created a wrapper that implements the __getitem__
method in a way that makes the data look and behave in a way that it will pass the validation methods used in GridSearchCV
, Pipeline
, and other cross validation methods.
I made these changes and it looks like your problem then comes from the validation method check_array
. While calling this method with dtype=pd.DataFrame
would work, the linear model calls this with dtype=np.float64
throwing an error. To get around this instead of concatenating the original data with you dummies you could just return your dummy columns and fit using those. This is something that should be done anyway since you wouldn't want to include both dummy columns and the original data in the model you are trying to fit. You may also consider the drop_first
option, but I'm getting off subject. So, changing your fit
method like so allows the whole process to work as expected.
def transform(self, X_DF):
''' Update X_DF to have set of dummy-variables instead of orig column'''
# convert self-attribute to local var for ease of stepping through function
column = self.column
# add columns for new dummy vars, and drop original categorical column
dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)
return dummy_matrix
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With