Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikitlearn Column Transformer Error: Column ordering must be equal for fit and for transform when using the remainder keyword

I have a simple model with a pipelie using ColumnTransformer

I am able to train the model and save the model as pickle

When I load the pickle and predict on the real-time data, I received the following error regarding ColumnTransformer

Column ordering must be equal for fit and for transform when using the remainder keyword

The training data and the data used for prediction has exact the same number of column , e.g., 50. I am not sure how the "ordering" of the column could have changed.

Why ordering of the column is important for columntransformer? How to fix this? Is there a way to ensure the "ordering" after running a column transformer?

Thanks.

   pipeline = Pipeline([
        ('RepalceInf', ReplaceInf()),
        ('impute_30_100', ColumnTransformer(
            [
                ('oneStdNorm', OneStdImputer(), self.cont_feature_strategy_dict['FEATS_30_100']),
            ],
            remainder='passthrough'
        )),
        ('regress_impute', IterativeImputer(random_state=0, estimator=self.cont_estimator)),
        ('replace_outlier', OutlierReplacer(quantile_range=(1, 99))),
        ('scaler', StandardScaler(with_mean=True))
    ])



class OneStdImputer(TransformerMixin, BaseEstimator):
def __init__(self):
    """
    Impute the missing data with random value in the range of mean +/- one standard deviation
    This is a simplified implementation without sparse/dense fit and check.
    """
    self.mean = None
    self.std = None

def fit(self, X, y=None):
    self.mean = X.mean()
    self.std = X.std()
    return self

def transform(self, X):
    # X_imp = X.fillna(np.random.randint()*2*self.std+self.mean-self.std)
    for col in X:
        self._fill_randnorm(X[col], col)
    return X

def _fill_randnorm(self, df, col):
    val = df.values
    mask = np.isnan(df)
    mu, sigma = self.mean[col], self.std[col]
    val[mask] = np.random.normal(mu, sigma, size=mask.sum())
    return df
like image 604
tudou Avatar asked Oct 11 '19 12:10

tudou


People also ask

What is ColumnTransformer in Sklearn?

Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

When should I use Sklearn ColumnTransformer?

Use the scikit-learn ColumnTransformer function to implement preprocessing functions such as MinMaxScaler and OneHotEncoder to numeric and categorical features simultaneously. Use ColumnTransformer to build all our transformations together into one object and use it with scikit-learn pipelines.

Why is column transformer used?

Column Transformer is a sciket-learn class used to create and apply separate transformers for numerical and categorical data. To create transformers we need to specify the transformer object and pass the list of transformations inside a tuple along with the column on which you want to apply the transformation.

What is a transformer in Sklearn?

In scikit-learn, Transformers are objects that transform a dataset into a new one to prepare the dataset for predictive modeling, e.g., scaling numeric values, one-hot encoding categoricals, etc.


1 Answers

You can use df_new =pd.DataFrame(df_origin, columns=df_train.columns to make sure the data to predict have same columns with training data.

And from the given example, it's obviously that ColumnTransformer will take the order number of a chosen column as a mark to process.(Although you can use exactly name to choose a column, but I think it will transform to number too)

>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
...     [("norm1", Normalizer(norm='l1'), [0, 1]),
...      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
...               [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])
like image 69
小笼包 Avatar answered Oct 18 '22 01:10

小笼包