Preserve column order after applying sklearn.compose.ColumnTransformer

Question

I'm using Pipeline and ColumnTransformer modules from sklearn library to perform feature engineering on my dataset.

The dataset initially looks like this:

date	shop_id	item_id	item_price
02.01.2013	59	22154	999.00
03.01.2013	25	2552	899.00
05.01.2013	25	2552	899.00
06.01.2013	25	2554	1709.05
15.01.2013	25	2555	1099.00

$> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  object  
 2   shop_id         object  
 3   item_id         object  
 4   item_price      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB

Then I have the following transformations:

num_column_transformer = ColumnTransformer(
    transformers=[
        ("std_scaler", StandardScaler(), make_column_selector(dtype_include=np.number)),
    ],
    remainder="passthrough"
)

num_pipeline = Pipeline(
    steps=[
        ("percent_item_cnt_day_per_shop", PercentOverTotalAttributeWholeAdder(
            attribute_percent_name="shop_id",
            attribute_total_name="item_cnt_day",
            new_attribute_name="%_item_cnt_day_per_shop")
        ),
        ("percent_item_cnt_day_per_item", PercentOverTotalAttributeWholeAdder(
            attribute_percent_name="item_id",
            attribute_total_name="item_cnt_day",
            new_attribute_name="%_item_cnt_day_per_item")
        ),
        ("percent_sales_per_shop", SalesPerAttributeOverTotalSalesAdder(
            attribute_percent_name="shop_id",
            new_attribute_name="%_sales_per_shop")
        ),
        ("percent_sales_per_item", SalesPerAttributeOverTotalSalesAdder(
            attribute_percent_name="item_id",
            new_attribute_name="%_sales_per_item")
        ),
        ("num_column_transformer", num_column_transformer),
    ]
)

The first four Transformers create four new different numeric variables and the last one applies StandardScaler over all the numerical values of the dataset.

After executing it, I get the following data:

0	1	2	3	4	5	7	8
-0.092652	-0.765612	-0.173122	-0.756606	-0.379775	02.01.2013	59	22154
-0.092652	1.557684	-0.175922	1.563224	-0.394319	03.01.2013	25	2552
-0.856351	1.557684	-0.175922	1.563224	-0.394319	05.01.2013	25	2552
-0.092652	1.557684	-0.17613	1.563224	-0.396646	06.01.2013	25	2554
-0.092652	1.557684	-0.173278	1.563224	-0.380647	15.01.2013	25	2555

I'd like to have the following output:

date	shop_id	item_id	item_price	%_item_cnt_day_per_shop	%_item_cnt_day_per_item	%_sales_per_shop	%_sales_per_item
02.01.2013	59	22154	-0.092652	-0.765612	-0.173122	-0.756606	-0.379775
03.01.2013	25	2552	-0.092652	1.557684	-0.175922	1.563224	-0.394319
05.01.2013	25	2552	-0.856351	1.557684	-0.175922	1.563224	-0.394319
06.01.2013	25	2554	-0.092652	1.557684	-0.17613	1.563224	-0.396646
15.01.2013	25	2555	-0.092652	1.557684	-0.173278	1.563224	-0.380647

As you can see, columns 5, 6, 7, and 8 from the output corresponds to the first four columns in the original dataset. For example, I don't know where the item_price feature lies in the outputted table.

How could I preserve the column order and names? After that, I want to do feature engineering over categorical variables and my Transformers make use of the feature column name.
Am I using correctly the Scikit-Learn API?

amiola · Accepted Answer

There's one point to be aware of when dealing with ColumnTransformer, which is reported within the doc as follows:

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list.

That's the reason why your ColumnTransformer instance messes things up. Indeed, consider this simplified example which resembles your setting:

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

df = pd.DataFrame({
               'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'], 
               'date_block_num': ['0', '0', '0', '0', '0'], 
               'shop_id': ['59', '25', '25', '25', '25'],
               'item_id': ['22514', '2252', '2252', '2254', '2255'], 
               'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})

ct = ColumnTransformer([
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))], 
    remainder='passthrough')

pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

enter image description here

As you might notice, the first column in the transformed dataframe turns out to be the numeric one, i.e. the one which undergoes the scaling (and the first in the transformers list).

Conversely, here's an example of how you can bypass such issue by postponing the scaling on numeric variables after passing through all the string variables and thus ensuring the possibility of getting the columns in your desired order:

ct = ColumnTransformer([
    ('pass', 'passthrough', make_column_selector(dtype_include=object)),
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
])

pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

enter image description here

To complete the picture, here is an attempt to reproduce your Pipeline (though the custom transformer is for sure slightly different from yours):

from sklearn.base import BaseEstimator, TransformerMixin

class PercentOverTotalAttributeWholeAdder(BaseEstimator, TransformerMixin):

    def __init__(self, attribute_percent_name='shop_id', new_attribute_name='%_item_cnt_day_per_shop'):
    self.attribute_percent_name = attribute_percent_name
    self.new_attribute_name = new_attribute_name
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        df[self.new_attribute_name] = df.groupby(by=self.attribute_percent_name)[self.attribute_percent_name].transform('count') / df.shape[0]
        return df

ct_pipe = ColumnTransformer([
    ('pass', 'passthrough', make_column_selector(dtype_include=object)),
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
    ], verbose_feature_names_out=False)

pipe = Pipeline([
    ('percent_item_cnt_day_per_shop', PercentOverTotalAttributeWholeAdder(
        attribute_percent_name='shop_id',
        new_attribute_name='%_item_cnt_day_per_shop')
    ),
    ('percent_item_cnt_day_per_item', PercentOverTotalAttributeWholeAdder(
        attribute_percent_name='item_id',
        new_attribute_name='%_item_cnt_day_per_item')
    ),
    ('column_trans', ct_pipe),
])

pd.DataFrame(pipe.fit_transform(df), columns=pipe[-1].get_feature_names_out())

enter image description here

As a final remark, observe that the verbose_feature_names_out=False parameter ensures that the names of the columns of the transformed dataframe do not show prefixes which refer to the different transformers in ColumnTransformer.

Preserve column order after applying sklearn.compose.ColumnTransformer

Tags:

python

pandas

scikit-learn

data-science

rober_dinero

1 Answers

amiola

Recent Activity

Donate For Us

Preserve column order after applying sklearn.compose.ColumnTransformer

Tags:

python

pandas

scikit-learn

data-science

rober_dinero

1 Answers

amiola

Related questions

Recent Activity

Donate For Us