I try to do a tutorial on Pipeline for students but I block. I'm not an expert but I'm trying to improve. So thank you for your indulgence. In fact, I try in a pipeline to execute several steps in preparing a dataframe for a classifier:
Here is my code:
class Descr_df(object):
def transform (self, X):
print ("Structure of the data: \n {}".format(X.head(5)))
print ("Features names: \n {}".format(X.columns))
print ("Target: \n {}".format(X.columns[0]))
print ("Shape of the data: \n {}".format(X.shape))
def fit(self, X, y=None):
return self
class Fillna(object):
def transform(self, X):
non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
for column in X.columns:
if column in non_numerics_columns:
X[column] = X[column].fillna(df[column].value_counts().idxmax())
else:
X[column] = X[column].fillna(X[column].mean())
return X
def fit(self, X,y=None):
return self
class Categorical_to_numerical(object):
def transform(self, X):
non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
le = LabelEncoder()
for column in non_numerics_columns:
X[column] = X[column].fillna(X[column].value_counts().idxmax())
le.fit(X[column])
X[column] = le.transform(X[column]).astype(int)
return X
def fit(self, X, y=None):
return self
If I execute step 1 and 2 or step 1 and 3 it works but if I execute step 1, 2 and 3 at the same time. I have this error:
pipeline = Pipeline([('df_intropesction', Descr_df()), ('fillna',Fillna()), ('Categorical_to_numerical', Categorical_to_numerical())])
pipeline.fit(X, y)
AttributeError: 'NoneType' object has no attribute 'columns'
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification.
The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).
Scikit-learn pipelines are a tool to simplify this process. They have several key benefits: They make your workflow much easier to read and understand. They enforce the implementation and order of steps in your project.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__', as in the example below.
This error arises because in the Pipeline the output of first estimator goes to the second, then the output of second estimator goes to third and so on...
From the documentation of Pipeline:
Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
So for your pipeline, the steps of execution are following:
Solution: Change the transform method of Descr_df to return the dataframe as it is:
def transform (self, X):
print ("Structure of the data: \n {}".format(X.head(5)))
print ("Features names: \n {}".format(X.columns))
print ("Target: \n {}".format(X.columns[0]))
print ("Shape of the data: \n {}".format(X.shape))
return X
Suggestion : Make your classes inherit from Base Estimator and Transformer classes in scikit to confirm to the good practice.
i.e change the class Descr_df(object)
to class Descr_df(BaseEstimator, TransformerMixin)
, Fillna(object)
to Fillna(BaseEstimator, TransformerMixin)
and so on.
See this example for more details on custom classes in Pipeline:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With