Using multiple custom classes with Pipeline sklearn (Python)

Tags:

I try to do a tutorial on Pipeline for students but I block. I'm not an expert but I'm trying to improve. So thank you for your indulgence. In fact, I try in a pipeline to execute several steps in preparing a dataframe for a classifier:

Step 1: Description of the dataframe
Step 2: Fill NaN Values
Step 3: Transforming Categorical Values into Numbers

Here is my code:

class Descr_df(object):

    def transform (self, X):
        print ("Structure of the data: \n {}".format(X.head(5)))
        print ("Features names: \n {}".format(X.columns))
        print ("Target: \n {}".format(X.columns[0]))
        print ("Shape of the data: \n {}".format(X.shape))

    def fit(self, X, y=None):
        return self

class Fillna(object):

    def transform(self, X):
        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
        for column in X.columns:
            if column in non_numerics_columns:
                X[column] = X[column].fillna(df[column].value_counts().idxmax())
            else:
                 X[column] = X[column].fillna(X[column].mean())            
        return X

    def fit(self, X,y=None):
        return self

class Categorical_to_numerical(object):

    def transform(self, X):
        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
        le = LabelEncoder()
        for column in non_numerics_columns:
            X[column] = X[column].fillna(X[column].value_counts().idxmax())
            le.fit(X[column])
            X[column] = le.transform(X[column]).astype(int)
        return X

    def fit(self, X, y=None):
        return self

If I execute step 1 and 2 or step 1 and 3 it works but if I execute step 1, 2 and 3 at the same time. I have this error:

pipeline = Pipeline([('df_intropesction', Descr_df()), ('fillna',Fillna()), ('Categorical_to_numerical', Categorical_to_numerical())])
pipeline.fit(X, y)
AttributeError: 'NoneType' object has no attribute 'columns'

603

asked Apr 19 '17 14:04

Jeremie Guez

1 Answers

This error arises because in the Pipeline the output of first estimator goes to the second, then the output of second estimator goes to third and so on...

From the documentation of Pipeline:

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

So for your pipeline, the steps of execution are following:

Descr_df.fit(X) -> doesn't do anything and returns self
newX = Descr_df.transform(X) -> should return some value to assign to newX that should be passed on to next estimator, but your definition does not return anything (only prints). So None is returned implicitly
Fillna.fit(newX) -> doesn't do anything and returns self
Fillna.transform(newX) -> Calls newX.columns. But newX=None from step2. Hence the error.

Solution: Change the transform method of Descr_df to return the dataframe as it is:

def transform (self, X):
    print ("Structure of the data: \n {}".format(X.head(5)))
    print ("Features names: \n {}".format(X.columns))
    print ("Target: \n {}".format(X.columns[0]))
    print ("Shape of the data: \n {}".format(X.shape))
    return X

Suggestion : Make your classes inherit from Base Estimator and Transformer classes in scikit to confirm to the good practice.

i.e change the class Descr_df(object) to class Descr_df(BaseEstimator, TransformerMixin), Fillna(object) to Fillna(BaseEstimator, TransformerMixin) and so on.

See this example for more details on custom classes in Pipeline:

http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

answered Oct 04 '22 20:10

Vivek Kumar

Related questions
                            
                                Multiprocessing: How to use pool.map on a list and function with arguments?
                            
                                How to update a histogram when a slider is used?
                            
                                finding index of none and nan in numpy array
                            
                                Python Enum class membership
                            
                                Are assert methods acceptable in production?
                            
                                Can numpy argsort return lower index for ties?
                            
                                Python: run tests from wheel or sdist
                            
                                Zip directory with subdirectories in Robot Framework
                            
                                Is it possible to do glmm in Python?
                            
                                How to resolve cannot import name pages (django/wagtail)
                            
                                Why does Coverage.py ignore files with no coverage?
                            
                                Maximum Common Subgraph in a Directed Graph
                            
                                Numpy: how to convert observations to probabilities?
                            
                                Failed to install zbar with pip on Windows
                            
                                Pytest: How to access command line arguments outside of a test
                            
                                Pycrypto : AES Decryption
                            
                                Windows Python interpreter exits on Ctrl+C
                            
                                slqlalchemy UniqueConstraint VS Index(unique=True)
                            
                                AttributeError: module 'tensorflow.python.summary.summary' has no attribute 'FileWriter'
                            
                                How can I count across several relationships in django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using multiple custom classes with Pipeline sklearn (Python)

Tags:

python

pandas

machine-learning

scikit-learn

pipeline

Jeremie Guez

People also ask

1 Answers

Vivek Kumar

Recent Activity

Donate For Us