How can I one hot encode a subset of columns?

Question

I have a data set which has some categorical columns. Here is a small sample:

Temp    precip dow  tod
-20.44  snow   4    14.5
-22.69  snow   4    15.216666666666667
-21.52  snow   4    17.316666666666666
-21.52  snow   4    17.733333333333334
-20.51  snow   4    18.15

Here, the dow and precip are categorical, where as the others are continuous.

Is there a way I can create a OneHotEncoder for just those columns? I don't want to use pd.get_dummies because that won't put the data in the proper format unless of each dow and precip are in the new data.

Marcus V. · Accepted Answer

Two things you could check out: sklearn-pandas and as mentioned by @Grr pipelines with this good intro.

So I prefer pipelines, as they are a tidy way, allow easy use with things like grid-seach, avoid leakage between folds in cross validation, etc. So I usually end up having a pipe like that (given you have precip LabelEncoded first):

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

class Normalize(BaseEstimator, TransformerMixin):
    def __init__(self, func=None, func_param={}):
        self.func = func
        self.func_param = func_param

    def transform(self, X):
        if self.func != None:
            return self.func(X, **self.func_param)
        else:
            return X

    def fit(self, X, y=None, **fit_params):
        return self


cat_cols = ['precip', 'dow']
num_cols = ['Temp','tod']

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=num_cols),Normalize())),
        ('categorical', make_pipeline(Columns(names=cat_cols),OneHotEncoder(sparse=False)))
    ])),
    ('model', LinearRegression())
])

How can I one hot encode a subset of columns?

Tags:

python

pandas

scikit-learn

feature-extraction

Demetri Pananos

1 Answers

Marcus V.

Recent Activity

Donate For Us

How can I one hot encode a subset of columns?

Tags:

python

pandas

scikit-learn

feature-extraction

Demetri Pananos

1 Answers

Marcus V.

Related questions

Recent Activity

Donate For Us