I have a data set which has some categorical columns. Here is a small sample:
Temp precip dow tod
-20.44 snow 4 14.5
-22.69 snow 4 15.216666666666667
-21.52 snow 4 17.316666666666666
-21.52 snow 4 17.733333333333334
-20.51 snow 4 18.15
Here, the dow
and precip
are categorical, where as the others are continuous.
Is there a way I can create a OneHotEncoder
for just those columns? I don't want to use pd.get_dummies
because that won't put the data in the proper format unless of each dow
and precip
are in the new data.
Two things you could check out: sklearn-pandas and as mentioned by @Grr pipelines with this good intro.
So I prefer pipelines, as they are a tidy way, allow easy use with things like grid-seach, avoid leakage between folds in cross validation, etc. So I usually end up having a pipe like that (given you have precip LabelEncoded first):
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
class Normalize(BaseEstimator, TransformerMixin):
def __init__(self, func=None, func_param={}):
self.func = func
self.func_param = func_param
def transform(self, X):
if self.func != None:
return self.func(X, **self.func_param)
else:
return X
def fit(self, X, y=None, **fit_params):
return self
cat_cols = ['precip', 'dow']
num_cols = ['Temp','tod']
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=num_cols),Normalize())),
('categorical', make_pipeline(Columns(names=cat_cols),OneHotEncoder(sparse=False)))
])),
('model', LinearRegression())
])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With