I'm currently exploring the scikit learn pipelines. I also want to preprocess the data with a pipeline. However, my train and test data have different levels of the categorical variable. Example: Consider:
import pandas as pd
train = pd.Series(list('abbaa'))
test = pd.Series(list('abcd'))
I wrote a TransformerMixinClass using pandas
class CreateDummies(TransformerMixin):
def transform(self, X, **transformparams):
return pd.get_dummies(X).copy()
def fit(self, X, y=None, **fitparams):
return self
fit_transform yields for the train data 2 columns and for the test data 4 columns. So no surprise here, but not suitable for a pipeline
Similary, I tried to import the label encoder (and OneHotEncoder for the potential next steps):
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
le.fit_transform(train)
le.transform(test)
which yields, not surprisingly, an error.
So the problem here is that I need some information contained in the test set. Is there a good way to include this in a pipeline?
You can use categoricals as explained in this answer:
categories = np.union1d(train, test)
train = train.astype('category', categories=categories)
test = test.astype('category', categories=categories)
pd.get_dummies(train)
Out:
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 1 0 0 0
4 1 0 0 0
pd.get_dummies(test)
Out:
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With