Custom Sklearn Transformer works alone, Throws Error When Used in Pipeline

Tags:

I have a simple sklearn class I would like to use as part of an sklearn pipeline. This class just takes a pandas dataframe X_DF and a categorical column name, and calls pd.get_dummies to return the dataframe with the column turned into a matrix of dummy variables...

import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator

class dummy_var_encoder(TransformerMixin, BaseEstimator):
    '''Convert selected categorical column to (set of) dummy variables    
    '''


    def __init__(self, column_to_dummy='default_col_name'):
        self.column = column_to_dummy
        print self.column

    def fit(self, X_DF, y=None):
        return self 

    def transform(self, X_DF):
        ''' Update X_DF to have set of dummy-variables instead of orig column'''        

        # convert self-attribute to local var for ease of stepping through function
        column = self.column

        # add columns for new dummy vars, and drop original categorical column
        dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

        new_DF = pd.concat([X_DF[column], dummy_matrix], axis=1)

        return new_DF

Now using this transformer on it's own to fit/transform, I get output as expected. For some toy data as below:

from sklearn import datasets
# Load toy data 
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Series(iris.target, name='y')

# Create Arbitrary categorical features
X['category_1'] = pd.cut(X['sepal length (cm)'], 
                         bins=3, 
                         labels=['small', 'medium', 'large'])

X['category_2'] = pd.cut(X['sepal width (cm)'], 
                         bins=3, 
                         labels=['small', 'medium', 'large'])

...my dummy encoder produces the correct output:

encoder = dummy_var_encoder(column_to_dummy = 'category_1')
encoder.fit(X)
encoder.transform(X).iloc[15:21,:]

category_1
   category_1  category_1_small  category_1_medium  category_1_large
15     medium                 0                  1                 0
16      small                 1                  0                 0
17      small                 1                  0                 0
18     medium                 0                  1                 0
19      small                 1                  0                 0
20      small                 1                  0                 0

However, when I call the same transformer from an sklearn pipeline as defined below:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV

# Define Pipeline
clf = LogisticRegression(penalty='l1')
pipeline_steps = [('dummy_vars', dummy_var_encoder()),
                  ('clf', clf)
                  ]

pipeline = Pipeline(pipeline_steps)

# Define hyperparams try for dummy-encoder and classifier
# Fit 4 models - try dummying category_1 vs category_2, and using l1 vs l2 penalty in log-reg
param_grid = {'dummy_vars__column_to_dummy': ['category_1', 'category_2'],
              'clf__penalty': ['l1', 'l2']
                  }

# Define full model search process 
cv_model_search = GridSearchCV(pipeline, 
                               param_grid, 
                               scoring='accuracy', 
                               cv = KFold(),
                               refit=True,
                               verbose = 3)

All's well until I fit the pipeline, at which point I get an error from the dummy encoder:

cv_model_search.fit(X,y=y)

In [101]: cv_model_search.fit(X,y=y) Fitting 3 folds for each of 4 candidates, totalling 12 fits

None None None None [CV] dummy_vars__column_to_dummy=category_1, clf__penalty=l1 .........

Traceback (most recent call last):

File "", line 1, in cv_model_search.fit(X,y=y)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 638, in fit cv.split(X, y, groups)))

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 779, in call while self.dispatch_one_batch(iterator):

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch self._dispatch(tasks)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async result = ImmediateResult(func)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in init self.results = batch()

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in call return [func(*args, **kwargs) for func, args, kwargs in self.items]

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 437, in _fit_and_score estimator.fit(X_train, y_train, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 257, in fit Xt, fit_params = self._fit(X, y, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 222, in _fit **fit_params_steps[name])

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 362, in call return self.func(*args, **kwargs)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 589, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/base.py", line 521, in fit_transform return self.fit(X, y, **fit_params).transform(X)

File "", line 21, in transform dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py", line 1964, in getitem return self._getitem_column(key)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py", line 1971, in _getitem_column return self._get_item_cache(key)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/generic.py", line 1645, in _get_item_cache values = self._data.get(item)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/internals.py", line 3599, in get raise ValueError("cannot label index with a null key")

ValueError: cannot label index with a null key

269

asked Oct 17 '17 02:10

Max Power

1 Answers

The trace is telling you exactly what went wrong. Learning to diagnose the trace is really quite invaluable especially when you are inheriting from libraries you might not have a complete understanding of.

Now, I have done a fair bit of inheriting in sklearn myself and I can tell you without a doubt GridSearchCV is going to give you some trouble if the type of data input into your fit or fit_transform methods are not NumPy arrays. As Vivek mentioned in his comment the X getting passed to your fit method is no longer a DataFrame. But let's take a look at the trace first.

ValueError: cannot label index with a null key

While Vivek is correct about the NumPy array you have another problem here. The actual error you get is that the value of column in your fit method is None. If you were to look at your encoder object above you would see the __repr__ method outputs the following:

dummy_var_encoder(column_to_dummy=None)

When using Pipeline, this param is what gets initialized and passed along to GridSearchCV. This is a behavior that can be seen throughout cross validation and search methods as well, and having attributes with different names from the input parameter causes issues like this. Fixing this will start you down the right path.

Modifying the __init__ method as such will solve this specific issue:

def __init__(self, column='default_col_name'):
    self.column = column
    print(self.column)

However, once you have done this the issue Vivek mentioned will rear it's head and you will have to deal with that. This is something I have run into before, though not with DataFrames specifically. I came up with a solution in Use sklearn GridSearchCV on custom class whose fit method takes 3 arguments. Basically I created a wrapper that implements the __getitem__ method in a way that makes the data look and behave in a way that it will pass the validation methods used in GridSearchCV, Pipeline, and other cross validation methods.

Edit

I made these changes and it looks like your problem then comes from the validation method check_array. While calling this method with dtype=pd.DataFrame would work, the linear model calls this with dtype=np.float64 throwing an error. To get around this instead of concatenating the original data with you dummies you could just return your dummy columns and fit using those. This is something that should be done anyway since you wouldn't want to include both dummy columns and the original data in the model you are trying to fit. You may also consider the drop_first option, but I'm getting off subject. So, changing your fit method like so allows the whole process to work as expected.

def transform(self, X_DF):
    ''' Update X_DF to have set of dummy-variables instead of orig column'''        

    # convert self-attribute to local var for ease of stepping through function
    column = self.column

    # add columns for new dummy vars, and drop original categorical column
    dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

    return dummy_matrix

answered Sep 19 '22 06:09

Grr

Related questions
                            
                                Python cProfile - most of time spent in method 'enable' of '_lsprof.Profiler'
                            
                                Errors about R_HOME when importing rpy2 (submodule)
                            
                                How to start coroutines and continue with synchronous tasks?
                            
                                Repeated use of fixtures in pytest
                            
                                Can't mock 'os.path.join' with pytest-mock
                            
                                pandas aggregate function with multiple output columns
                            
                                Same Tensorflow model giving different results on Android and Python
                            
                                Django Rest Framework HyperLinkedRelatedField: allow id instead of url for POSTS requests
                            
                                Modulate complex signal on all gpio
                            
                                How to fetch GET and POST params inside a AWS Lambda Function
                            
                                Why are my examples and labels in the wrong order?
                            
                                How to determine object orientation in binary image? (Python, OpenCV)
                            
                                Why can't I find an input element with a placeholder in selenium?
                            
                                Python Tkinter: Binding Keypress Event to Active Tab in ttk.Notebook
                            
                                lightweight alternative for pandas
                            
                                Returning an array of structs in Cython
                            
                                Python socketio example to connect to cryptocompare
                            
                                Share memory between C/C++ and Python
                            
                                apply function with constant parameter to pandas dataframe
                            
                                What is the equivalent to iloc for dask dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Custom Sklearn Transformer works alone, Throws Error When Used in Pipeline

Tags:

python

pandas

machine-learning

scikit-learn

pipeline

Max Power

People also ask

1 Answers

Edit

Grr

Recent Activity

Donate For Us