I have a simple model with a pipelie using ColumnTransformer
I am able to train the model and save the model as pickle
When I load the pickle and predict on the real-time data, I received the following error regarding ColumnTransformer
Column ordering must be equal for fit and for transform when using the remainder keyword
The training data and the data used for prediction has exact the same number of column , e.g., 50. I am not sure how the "ordering" of the column could have changed.
Why ordering of the column is important for columntransformer? How to fix this? Is there a way to ensure the "ordering" after running a column transformer?
Thanks.
pipeline = Pipeline([
('RepalceInf', ReplaceInf()),
('impute_30_100', ColumnTransformer(
[
('oneStdNorm', OneStdImputer(), self.cont_feature_strategy_dict['FEATS_30_100']),
],
remainder='passthrough'
)),
('regress_impute', IterativeImputer(random_state=0, estimator=self.cont_estimator)),
('replace_outlier', OutlierReplacer(quantile_range=(1, 99))),
('scaler', StandardScaler(with_mean=True))
])
class OneStdImputer(TransformerMixin, BaseEstimator):
def __init__(self):
"""
Impute the missing data with random value in the range of mean +/- one standard deviation
This is a simplified implementation without sparse/dense fit and check.
"""
self.mean = None
self.std = None
def fit(self, X, y=None):
self.mean = X.mean()
self.std = X.std()
return self
def transform(self, X):
# X_imp = X.fillna(np.random.randint()*2*self.std+self.mean-self.std)
for col in X:
self._fill_randnorm(X[col], col)
return X
def _fill_randnorm(self, df, col):
val = df.values
mask = np.isnan(df)
mu, sigma = self.mean[col], self.std[col]
val[mask] = np.random.normal(mu, sigma, size=mask.sum())
return df
Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.
Use the scikit-learn ColumnTransformer function to implement preprocessing functions such as MinMaxScaler and OneHotEncoder to numeric and categorical features simultaneously. Use ColumnTransformer to build all our transformations together into one object and use it with scikit-learn pipelines.
Column Transformer is a sciket-learn class used to create and apply separate transformers for numerical and categorical data. To create transformers we need to specify the transformer object and pass the list of transformations inside a tuple along with the column on which you want to apply the transformation.
In scikit-learn, Transformers are objects that transform a dataset into a new one to prepare the dataset for predictive modeling, e.g., scaling numeric values, one-hot encoding categoricals, etc.
You can use df_new =pd.DataFrame(df_origin, columns=df_train.columns
to make sure the data to predict have same columns with training data.
And from the given example, it's obviously that ColumnTransformer
will take the order number of a chosen column as a mark to process.(Although you can use exactly name to choose a column, but I think it will transform to number too)
>>> import numpy as np
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
... [("norm1", Normalizer(norm='l1'), [0, 1]),
... ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
... [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)
array([[0. , 1. , 0.5, 0.5],
[0.5, 0.5, 0. , 1. ]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With