Appending the ColumnTransformer() result to the original data within a pipeline?

Tags:

This is my input data:

enter image description here

This is the desired output with transformations applied to the columns r, f, and m and the result is appended to the original data

enter image description here

Here's the code:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer    

df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('rfm'))
column_trans = ColumnTransformer(
    [('r_std', StandardScaler(), ['r']),
     ('f_std', StandardScaler(), ['f']),
     ('m_std', StandardScaler(), ['m']),
     ('r_boxcox', PowerTransformer(method='box-cox'), ['r']),
     ('f_boxcox', PowerTransformer(method='box-cox'), ['f']),
     ('m_boxcox', PowerTransformer(method='box-cox'), ['m']),
    ])

transformed = column_trans.fit_transform(df)
new_cols = ['r_std', 'f_std', 'm_std', 'r_boxcox', 'f_boxcox', 'm_boxcox']

transformed_df = pd.DataFrame(transformed, columns=new_cols)
pd.concat([df, transformed_df], axis = 1)

I'll need additional transformers as well, so I need to keep the originating columns within a pipeline. Is there a better way to handle this? In particular doing the concatenation and column naming within a pipeline?

433

asked Feb 08 '19 12:02

zinyosrim

2 Answers

One way to do it would be using a dummy transformer that just returns the transformed column with its original value:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer    

np.random.seed(1714)

class NoTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X

I'm adding an id column to the dataset so I can show the use of the remainder parameter in ColumnTransformer(), which I find very useful.

df = pd.DataFrame(np.hstack((np.arange(10).reshape((10, 1)),
                             np.random.randint(1,100,size=(10, 3)))),
                  columns=["id"] + list('rfm'))

Using remainder with the value passthrough (by default the value is drop) one can retain the columns that are not transformed; from the docs.

And using the NoTransformer() dummy class we can transform the columns 'r', 'f', 'm' to have the same value.

column_trans = ColumnTransformer(
    [('r_original', NoTransformer(), ['r']),
     ('f_original', NoTransformer(), ['f']),
     ('m_original', NoTransformer(), ['m']),
     ('r_std', StandardScaler(), ['r']),
     ('f_std', StandardScaler(), ['f']),
     ('m_std', StandardScaler(), ['m']),
     ('r_boxcox', PowerTransformer(method='box-cox'), ['r']),
     ('f_boxcox', PowerTransformer(method='box-cox'), ['f']),
     ('m_boxcox', PowerTransformer(method='box-cox'), ['m']),
    ], remainder="passthrough")

A tip if you want to transform many more columns: the fitted ColumnTransformer() class (column_trans in your case) has a transformers_ method that lets you access the names ['r_std', 'f_std', 'm_std', 'r_boxcox', 'f_boxcox', 'm_boxcox'] programmatically:

column_trans.transformers_

#[('r_original', NoTransformer(), ['r']),
# ('f_original', NoTransformer(), ['f']),
# ('m_original', NoTransformer(), ['m']),
# ('r_std', StandardScaler(copy=True, with_mean=True, with_std=True), ['r']),
# ('f_std', StandardScaler(copy=True, with_mean=True, with_std=True), ['f']),
# ('m_std', StandardScaler(copy=True, with_mean=True, with_std=True), ['m']),
# ('r_boxcox',
#  PowerTransformer(copy=True, method='box-cox', standardize=True),
#  ['r']),
# ('f_boxcox',
#  PowerTransformer(copy=True, method='box-cox', standardize=True),
#  ['f']),
# ('m_boxcox',
#  PowerTransformer(copy=True, method='box-cox', standardize=True),
#  ['m']),
# ('remainder', 'passthrough', [0])]

Finally, I think your code could be simplified like this:

column_trans_2 = ColumnTransformer(
    ([
     ('original', NoTransformer(), ['r', 'f', 'm']),
     ('std', StandardScaler(), ['r', 'f', 'm']),
     ('boxcox', PowerTransformer(method='box-cox'), ['r', 'f', 'm']),
    ]), remainder="passthrough")

transformed_2 = column_trans_2.fit_transform(df)
column_trans_2.transformers_

#[('std',
#  StandardScaler(copy=True, with_mean=True, with_std=True),
#  ['r', 'f', 'm']),
# ('boxcox',
#  PowerTransformer(copy=True, method='box-cox', standardize=True),
#  ['r', 'f', 'm'])]

And assign the column names programmatically through transformers_:

new_col_names = []
for i in range(len(column_trans_2.transformers)):
    new_col_names += [column_trans_2.transformers[i][0] + '_' + s for s in column_trans_2.transformers[i][2]]
# The non-transformed columns ('id' in this case) will be appended on the right of
# the array and do not show up in the 'transformers_' method.
# Add the id columns to the col_names manually
new_col_names += ['id']

# ['original_r', 'original_f', 'original_m', 'std_r', 'std_f', 'std_m', 'boxcox_r',
#  'boxcox_f', 'boxcox_m', 'id']


pd.DataFrame(transformed_2, columns=new_col_names)

answered Oct 30 '22 23:10

J.A.

Yes, there is a way to do this which luckily is included in SKLearn. In the original documentation of ColumnTransformer you can find a confusing but useful line, which is the following:

transformer{‘drop’, ‘passthrough’} or estimator

Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

This means that if you want to keep a column during ColumnTransformer or drop a column during ColumnTransformer, you can simply indicate it using one of the two special-cased strings, just like this:

column_trans = ColumnTransformer(
[('r_std', StandardScaler(), ['r']),
 ('f_std', StandardScaler(), ['f']),
 ('m_std', StandardScaler(), ['m']),
 ('r_boxcox', PowerTransformer(method='box-cox'), ['r']),
 ('f_boxcox', PowerTransformer(method='box-cox'), ['f']),
 ('m_boxcox', PowerTransformer(method='box-cox'), ['m']),
 ('col_keep', 'passthrough', ['r','f','m'])
])

If you then use the ColumnTransformer, those 3 columns will be kept and not dropped. Alternatively, if you use 'drop' instead of 'passthrough', you can selectively drop certain columns. This in combination with remainder='passthrough' would allow you to drop some columns and keep all of the others. I hope you find this useful!

answered Oct 31 '22 00:10

S. Czop

Related questions
                            
                                Set my jupyter notebook to use python version of an enviroment
                            
                                pandas - converting d-mmm-yy to datetime object
                            
                                Pandas DataFrame change a value based on column, index values comparison
                            
                                Is there a Python API for event-driven Kafka consumer?
                            
                                Convert JSON Dictionary to JSON Array in python
                            
                                How to use F-score as error function to train neural networks?
                            
                                How to apply best fit line to time series in python
                            
                                Black background behind a figure's labels and ticks, only after saving figure but not in Python Interactive view (VS Code with Jupyter functionality)?
                            
                                How do you modify form data before saving it while using Django's CreateView?
                            
                                String Matching Using TF-IDF, NGrams and Cosine Similarity in Python
                            
                                Merge lists is multiple columns of a pandas dataframe into a sigle list in a column
                            
                                Convert column to string, retaining NaN (as None or blank)
                            
                                Annotate values for stacked horizontal bar in pandas
                            
                                calculate the time difference between two consecutive rows in pandas
                            
                                How to get the raw JSON response of a HTTP request from `driver.page_source` in Selenium webdriver Firefox
                            
                                Is there a way to add an attribute to a function as part of the function definition?
                            
                                Why is cross_val_predict so much slower than fit for KNeighborsClassifier?
                            
                                Python Remove All Elements in List Before Specific Element
                            
                                How to assuredly suppress a DeprecationWarning in Python?
                            
                                Reverse a list in python based on condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Appending the ColumnTransformer() result to the original data within a pipeline?

Tags:

python

pandas

scikit-learn

pipeline

zinyosrim

People also ask

2 Answers

J.A.

S. Czop

Recent Activity

Donate For Us