The main goals are as follows:
Apply StandardScaler
to continuous variables
Apply LabelEncoder
and OnehotEncoder
to categorical variables
The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler
would result in undesired effects.
On the flip side, the StandardScaler
would scale the integer based categorical variables, which is also not what we want.
Since continuous variables and categorical ones are mixed in a single Pandas
DataFrame, what's the recommended workflow to approach this kind of problem?
The best example to illustrate my point is the Kaggle Bike Sharing Demand dataset, where season
and weather
are integer categorical variables
Clustering datasets having both numerical and categorical variables.
The continuous variables need to be scaled, but at the same time, a couple of categorical variables are also of integer type. Applying StandardScaler would result in undesired effects. On the flip side, the StandardScaler would scale the integer based categorical variables, which is also not what we want.
One Hot Encoding To increase performance one can also first perform label encoding then those integer variables to binary values which will become the most desired form of machine-readable. Pandas get_dummies() converts categorical variables into dummy/indicator variables.
Check out the sklearn_pandas.DataFrameMapper
meta-transformer. Use it as the first step in your pipeline to perform column-wise data engineering operations:
mapper = DataFrameMapper(
[(continuous_col, StandardScaler()) for continuous_col in continuous_cols] +
[(categorical_col, LabelBinarizer()) for categorical_col in categorical_cols]
)
pipeline = Pipeline(
[("mapper", mapper),
("estimator", estimator)]
)
pipeline.fit_transform(df, df["y"])
Also, you should be using sklearn.preprocessing.LabelBinarizer
instead of a list of [LabelEncoder(), OneHotEncoder()]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With