Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Appending the ColumnTransformer() result to the original data within a pipeline?

This is my input data:

enter image description here

This is the desired output with transformations applied to the columns r, f, and m and the result is appended to the original data

enter image description here

Here's the code:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer    

df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('rfm'))
column_trans = ColumnTransformer(
    [('r_std', StandardScaler(), ['r']),
     ('f_std', StandardScaler(), ['f']),
     ('m_std', StandardScaler(), ['m']),
     ('r_boxcox', PowerTransformer(method='box-cox'), ['r']),
     ('f_boxcox', PowerTransformer(method='box-cox'), ['f']),
     ('m_boxcox', PowerTransformer(method='box-cox'), ['m']),
    ])

transformed = column_trans.fit_transform(df)
new_cols = ['r_std', 'f_std', 'm_std', 'r_boxcox', 'f_boxcox', 'm_boxcox']

transformed_df = pd.DataFrame(transformed, columns=new_cols)
pd.concat([df, transformed_df], axis = 1)

I'll need additional transformers as well, so I need to keep the originating columns within a pipeline. Is there a better way to handle this? In particular doing the concatenation and column naming within a pipeline?

like image 433
zinyosrim Avatar asked Feb 08 '19 12:02

zinyosrim


People also ask

What does ColumnTransformer do in Sklearn?

Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

What is a ColumnTransformer?

The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms.

Why is column transformer used?

Column Transformer is a sciket-learn class used to create and apply separate transformers for numerical and categorical data. To create transformers we need to specify the transformer object and pass the list of transformations inside a tuple along with the column on which you want to apply the transformation.

What is feature Union Sklearn?

class sklearn.pipeline.FeatureUnion(transformer_list, n_jobs=None, transformer_weights=None) [source] Concatenates results of multiple transformer objects. This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results.


2 Answers

One way to do it would be using a dummy transformer that just returns the transformed column with its original value:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer    

np.random.seed(1714)

class NoTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X

I'm adding an id column to the dataset so I can show the use of the remainder parameter in ColumnTransformer(), which I find very useful.

df = pd.DataFrame(np.hstack((np.arange(10).reshape((10, 1)),
                             np.random.randint(1,100,size=(10, 3)))),
                  columns=["id"] + list('rfm'))

Using remainder with the value passthrough (by default the value is drop) one can retain the columns that are not transformed; from the docs.

And using the NoTransformer() dummy class we can transform the columns 'r', 'f', 'm' to have the same value.

column_trans = ColumnTransformer(
    [('r_original', NoTransformer(), ['r']),
     ('f_original', NoTransformer(), ['f']),
     ('m_original', NoTransformer(), ['m']),
     ('r_std', StandardScaler(), ['r']),
     ('f_std', StandardScaler(), ['f']),
     ('m_std', StandardScaler(), ['m']),
     ('r_boxcox', PowerTransformer(method='box-cox'), ['r']),
     ('f_boxcox', PowerTransformer(method='box-cox'), ['f']),
     ('m_boxcox', PowerTransformer(method='box-cox'), ['m']),
    ], remainder="passthrough")

A tip if you want to transform many more columns: the fitted ColumnTransformer() class (column_trans in your case) has a transformers_ method that lets you access the names ['r_std', 'f_std', 'm_std', 'r_boxcox', 'f_boxcox', 'm_boxcox'] programmatically:

column_trans.transformers_

#[('r_original', NoTransformer(), ['r']),
# ('f_original', NoTransformer(), ['f']),
# ('m_original', NoTransformer(), ['m']),
# ('r_std', StandardScaler(copy=True, with_mean=True, with_std=True), ['r']),
# ('f_std', StandardScaler(copy=True, with_mean=True, with_std=True), ['f']),
# ('m_std', StandardScaler(copy=True, with_mean=True, with_std=True), ['m']),
# ('r_boxcox',
#  PowerTransformer(copy=True, method='box-cox', standardize=True),
#  ['r']),
# ('f_boxcox',
#  PowerTransformer(copy=True, method='box-cox', standardize=True),
#  ['f']),
# ('m_boxcox',
#  PowerTransformer(copy=True, method='box-cox', standardize=True),
#  ['m']),
# ('remainder', 'passthrough', [0])]


Finally, I think your code could be simplified like this:

column_trans_2 = ColumnTransformer(
    ([
     ('original', NoTransformer(), ['r', 'f', 'm']),
     ('std', StandardScaler(), ['r', 'f', 'm']),
     ('boxcox', PowerTransformer(method='box-cox'), ['r', 'f', 'm']),
    ]), remainder="passthrough")

transformed_2 = column_trans_2.fit_transform(df)
column_trans_2.transformers_

#[('std',
#  StandardScaler(copy=True, with_mean=True, with_std=True),
#  ['r', 'f', 'm']),
# ('boxcox',
#  PowerTransformer(copy=True, method='box-cox', standardize=True),
#  ['r', 'f', 'm'])]

And assign the column names programmatically through transformers_:

new_col_names = []
for i in range(len(column_trans_2.transformers)):
    new_col_names += [column_trans_2.transformers[i][0] + '_' + s for s in column_trans_2.transformers[i][2]]
# The non-transformed columns ('id' in this case) will be appended on the right of
# the array and do not show up in the 'transformers_' method.
# Add the id columns to the col_names manually
new_col_names += ['id']

# ['original_r', 'original_f', 'original_m', 'std_r', 'std_f', 'std_m', 'boxcox_r',
#  'boxcox_f', 'boxcox_m', 'id']


pd.DataFrame(transformed_2, columns=new_col_names)
like image 52
J.A. Avatar answered Oct 30 '22 23:10

J.A.


Yes, there is a way to do this which luckily is included in SKLearn. In the original documentation of ColumnTransformer you can find a confusing but useful line, which is the following:

transformer{‘drop’, ‘passthrough’} or estimator

Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

This means that if you want to keep a column during ColumnTransformer or drop a column during ColumnTransformer, you can simply indicate it using one of the two special-cased strings, just like this:

column_trans = ColumnTransformer(
[('r_std', StandardScaler(), ['r']),
 ('f_std', StandardScaler(), ['f']),
 ('m_std', StandardScaler(), ['m']),
 ('r_boxcox', PowerTransformer(method='box-cox'), ['r']),
 ('f_boxcox', PowerTransformer(method='box-cox'), ['f']),
 ('m_boxcox', PowerTransformer(method='box-cox'), ['m']),
 ('col_keep', 'passthrough', ['r','f','m'])
])

If you then use the ColumnTransformer, those 3 columns will be kept and not dropped. Alternatively, if you use 'drop' instead of 'passthrough', you can selectively drop certain columns. This in combination with remainder='passthrough' would allow you to drop some columns and keep all of the others. I hope you find this useful!

like image 40
S. Czop Avatar answered Oct 31 '22 00:10

S. Czop