Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom Sklearn Transformer works alone, Throws Error When Used in Pipeline

I have a simple sklearn class I would like to use as part of an sklearn pipeline. This class just takes a pandas dataframe X_DF and a categorical column name, and calls pd.get_dummies to return the dataframe with the column turned into a matrix of dummy variables...

import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator

class dummy_var_encoder(TransformerMixin, BaseEstimator):
    '''Convert selected categorical column to (set of) dummy variables    
    '''


    def __init__(self, column_to_dummy='default_col_name'):
        self.column = column_to_dummy
        print self.column

    def fit(self, X_DF, y=None):
        return self 

    def transform(self, X_DF):
        ''' Update X_DF to have set of dummy-variables instead of orig column'''        

        # convert self-attribute to local var for ease of stepping through function
        column = self.column

        # add columns for new dummy vars, and drop original categorical column
        dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

        new_DF = pd.concat([X_DF[column], dummy_matrix], axis=1)

        return new_DF

Now using this transformer on it's own to fit/transform, I get output as expected. For some toy data as below:

from sklearn import datasets
# Load toy data 
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Series(iris.target, name='y')

# Create Arbitrary categorical features
X['category_1'] = pd.cut(X['sepal length (cm)'], 
                         bins=3, 
                         labels=['small', 'medium', 'large'])

X['category_2'] = pd.cut(X['sepal width (cm)'], 
                         bins=3, 
                         labels=['small', 'medium', 'large'])

...my dummy encoder produces the correct output:

encoder = dummy_var_encoder(column_to_dummy = 'category_1')
encoder.fit(X)
encoder.transform(X).iloc[15:21,:]

category_1
   category_1  category_1_small  category_1_medium  category_1_large
15     medium                 0                  1                 0
16      small                 1                  0                 0
17      small                 1                  0                 0
18     medium                 0                  1                 0
19      small                 1                  0                 0
20      small                 1                  0                 0

However, when I call the same transformer from an sklearn pipeline as defined below:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV

# Define Pipeline
clf = LogisticRegression(penalty='l1')
pipeline_steps = [('dummy_vars', dummy_var_encoder()),
                  ('clf', clf)
                  ]

pipeline = Pipeline(pipeline_steps)

# Define hyperparams try for dummy-encoder and classifier
# Fit 4 models - try dummying category_1 vs category_2, and using l1 vs l2 penalty in log-reg
param_grid = {'dummy_vars__column_to_dummy': ['category_1', 'category_2'],
              'clf__penalty': ['l1', 'l2']
                  }

# Define full model search process 
cv_model_search = GridSearchCV(pipeline, 
                               param_grid, 
                               scoring='accuracy', 
                               cv = KFold(),
                               refit=True,
                               verbose = 3) 

All's well until I fit the pipeline, at which point I get an error from the dummy encoder:

cv_model_search.fit(X,y=y)

In [101]: cv_model_search.fit(X,y=y) Fitting 3 folds for each of 4 candidates, totalling 12 fits

None None None None [CV] dummy_vars__column_to_dummy=category_1, clf__penalty=l1 .........

Traceback (most recent call last):

File "", line 1, in cv_model_search.fit(X,y=y)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 638, in fit cv.split(X, y, groups)))

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 779, in call while self.dispatch_one_batch(iterator):

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch self._dispatch(tasks)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async result = ImmediateResult(func)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in init self.results = batch()

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in call return [func(*args, **kwargs) for func, args, kwargs in self.items]

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 437, in _fit_and_score estimator.fit(X_train, y_train, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 257, in fit Xt, fit_params = self._fit(X, y, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 222, in _fit **fit_params_steps[name])

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 362, in call return self.func(*args, **kwargs)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 589, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/base.py", line 521, in fit_transform return self.fit(X, y, **fit_params).transform(X)

File "", line 21, in transform dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py", line 1964, in getitem return self._getitem_column(key)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py", line 1971, in _getitem_column return self._get_item_cache(key)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/generic.py", line 1645, in _get_item_cache values = self._data.get(item)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/internals.py", line 3599, in get raise ValueError("cannot label index with a null key")

ValueError: cannot label index with a null key

like image 269
Max Power Avatar asked Oct 17 '17 02:10

Max Power


People also ask

How do pipelines work Sklearn?

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.

What are the essential methods for transformer in Scikit-Learn?

For our transformer to work smoothly with Scikit-Learn, we should have three methods: fit() transform() fit_transform.

What is Sklearn transformer?

The sklearn is a Python-based machine learning package that provides a collection of various data transformations for modifying the data as per need. Many simple data cleaning processes, such as deleting columns etc, are often done manually on the data, so we need to use custom code.

What is Sklearn make pipeline?

Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.


1 Answers

The trace is telling you exactly what went wrong. Learning to diagnose the trace is really quite invaluable especially when you are inheriting from libraries you might not have a complete understanding of.

Now, I have done a fair bit of inheriting in sklearn myself and I can tell you without a doubt GridSearchCV is going to give you some trouble if the type of data input into your fit or fit_transform methods are not NumPy arrays. As Vivek mentioned in his comment the X getting passed to your fit method is no longer a DataFrame. But let's take a look at the trace first.

ValueError: cannot label index with a null key

While Vivek is correct about the NumPy array you have another problem here. The actual error you get is that the value of column in your fit method is None. If you were to look at your encoder object above you would see the __repr__ method outputs the following:

dummy_var_encoder(column_to_dummy=None)

When using Pipeline, this param is what gets initialized and passed along to GridSearchCV. This is a behavior that can be seen throughout cross validation and search methods as well, and having attributes with different names from the input parameter causes issues like this. Fixing this will start you down the right path.

Modifying the __init__ method as such will solve this specific issue:

def __init__(self, column='default_col_name'):
    self.column = column
    print(self.column)

However, once you have done this the issue Vivek mentioned will rear it's head and you will have to deal with that. This is something I have run into before, though not with DataFrames specifically. I came up with a solution in Use sklearn GridSearchCV on custom class whose fit method takes 3 arguments. Basically I created a wrapper that implements the __getitem__ method in a way that makes the data look and behave in a way that it will pass the validation methods used in GridSearchCV, Pipeline, and other cross validation methods.

Edit

I made these changes and it looks like your problem then comes from the validation method check_array. While calling this method with dtype=pd.DataFrame would work, the linear model calls this with dtype=np.float64 throwing an error. To get around this instead of concatenating the original data with you dummies you could just return your dummy columns and fit using those. This is something that should be done anyway since you wouldn't want to include both dummy columns and the original data in the model you are trying to fit. You may also consider the drop_first option, but I'm getting off subject. So, changing your fit method like so allows the whole process to work as expected.

def transform(self, X_DF):
    ''' Update X_DF to have set of dummy-variables instead of orig column'''        

    # convert self-attribute to local var for ease of stepping through function
    column = self.column

    # add columns for new dummy vars, and drop original categorical column
    dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

    return dummy_matrix
like image 71
Grr Avatar answered Sep 19 '22 06:09

Grr