Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserve column order after applying sklearn.compose.ColumnTransformer

I'm using Pipeline and ColumnTransformer modules from sklearn library to perform feature engineering on my dataset.

The dataset initially looks like this:

date date_block_num shop_id item_id item_price
02.01.2013 0 59 22154 999.00
03.01.2013 0 25 2552 899.00
05.01.2013 0 25 2552 899.00
06.01.2013 0 25 2554 1709.05
15.01.2013 0 25 2555 1099.00
$> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   date            object 
 1   date_block_num  object  
 2   shop_id         object  
 3   item_id         object  
 4   item_price      float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB

Then I have the following transformations:

num_column_transformer = ColumnTransformer(
    transformers=[
        ("std_scaler", StandardScaler(), make_column_selector(dtype_include=np.number)),
    ],
    remainder="passthrough"
)

num_pipeline = Pipeline(
    steps=[
        ("percent_item_cnt_day_per_shop", PercentOverTotalAttributeWholeAdder(
            attribute_percent_name="shop_id",
            attribute_total_name="item_cnt_day",
            new_attribute_name="%_item_cnt_day_per_shop")
        ),
        ("percent_item_cnt_day_per_item", PercentOverTotalAttributeWholeAdder(
            attribute_percent_name="item_id",
            attribute_total_name="item_cnt_day",
            new_attribute_name="%_item_cnt_day_per_item")
        ),
        ("percent_sales_per_shop", SalesPerAttributeOverTotalSalesAdder(
            attribute_percent_name="shop_id",
            new_attribute_name="%_sales_per_shop")
        ),
        ("percent_sales_per_item", SalesPerAttributeOverTotalSalesAdder(
            attribute_percent_name="item_id",
            new_attribute_name="%_sales_per_item")
        ),
        ("num_column_transformer", num_column_transformer),
    ]
)

The first four Transformers create four new different numeric variables and the last one applies StandardScaler over all the numerical values of the dataset.

After executing it, I get the following data:

0 1 2 3 4 5 6 7 8
-0.092652 -0.765612 -0.173122 -0.756606 -0.379775 02.01.2013 0 59 22154
-0.092652 1.557684 -0.175922 1.563224 -0.394319 03.01.2013 0 25 2552
-0.856351 1.557684 -0.175922 1.563224 -0.394319 05.01.2013 0 25 2552
-0.092652 1.557684 -0.17613 1.563224 -0.396646 06.01.2013 0 25 2554
-0.092652 1.557684 -0.173278 1.563224 -0.380647 15.01.2013 0 25 2555

I'd like to have the following output:

date date_block_num shop_id item_id item_price %_item_cnt_day_per_shop %_item_cnt_day_per_item %_sales_per_shop %_sales_per_item
02.01.2013 0 59 22154 -0.092652 -0.765612 -0.173122 -0.756606 -0.379775
03.01.2013 0 25 2552 -0.092652 1.557684 -0.175922 1.563224 -0.394319
05.01.2013 0 25 2552 -0.856351 1.557684 -0.175922 1.563224 -0.394319
06.01.2013 0 25 2554 -0.092652 1.557684 -0.17613 1.563224 -0.396646
15.01.2013 0 25 2555 -0.092652 1.557684 -0.173278 1.563224 -0.380647

As you can see, columns 5, 6, 7, and 8 from the output corresponds to the first four columns in the original dataset. For example, I don't know where the item_price feature lies in the outputted table.

  1. How could I preserve the column order and names? After that, I want to do feature engineering over categorical variables and my Transformers make use of the feature column name.
  2. Am I using correctly the Scikit-Learn API?
like image 951
rober_dinero Avatar asked Nov 15 '22 19:11

rober_dinero


1 Answers

There's one point to be aware of when dealing with ColumnTransformer, which is reported within the doc as follows:

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list.

That's the reason why your ColumnTransformer instance messes things up. Indeed, consider this simplified example which resembles your setting:

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

df = pd.DataFrame({
               'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'], 
               'date_block_num': ['0', '0', '0', '0', '0'], 
               'shop_id': ['59', '25', '25', '25', '25'],
               'item_id': ['22514', '2252', '2252', '2254', '2255'], 
               'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})

ct = ColumnTransformer([
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))], 
    remainder='passthrough')

pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

enter image description here

As you might notice, the first column in the transformed dataframe turns out to be the numeric one, i.e. the one which undergoes the scaling (and the first in the transformers list).

Conversely, here's an example of how you can bypass such issue by postponing the scaling on numeric variables after passing through all the string variables and thus ensuring the possibility of getting the columns in your desired order:

ct = ColumnTransformer([
    ('pass', 'passthrough', make_column_selector(dtype_include=object)),
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
])

pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())

enter image description here

To complete the picture, here is an attempt to reproduce your Pipeline (though the custom transformer is for sure slightly different from yours):

from sklearn.base import BaseEstimator, TransformerMixin

class PercentOverTotalAttributeWholeAdder(BaseEstimator, TransformerMixin):

    def __init__(self, attribute_percent_name='shop_id', new_attribute_name='%_item_cnt_day_per_shop'):
    self.attribute_percent_name = attribute_percent_name
    self.new_attribute_name = new_attribute_name
    
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        df[self.new_attribute_name] = df.groupby(by=self.attribute_percent_name)[self.attribute_percent_name].transform('count') / df.shape[0]
        return df

ct_pipe = ColumnTransformer([
    ('pass', 'passthrough', make_column_selector(dtype_include=object)),
    ('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
    ], verbose_feature_names_out=False)

pipe = Pipeline([
    ('percent_item_cnt_day_per_shop', PercentOverTotalAttributeWholeAdder(
        attribute_percent_name='shop_id',
        new_attribute_name='%_item_cnt_day_per_shop')
    ),
    ('percent_item_cnt_day_per_item', PercentOverTotalAttributeWholeAdder(
        attribute_percent_name='item_id',
        new_attribute_name='%_item_cnt_day_per_item')
    ),
    ('column_trans', ct_pipe),
])

pd.DataFrame(pipe.fit_transform(df), columns=pipe[-1].get_feature_names_out())

enter image description here

As a final remark, observe that the verbose_feature_names_out=False parameter ensures that the names of the columns of the transformed dataframe do not show prefixes which refer to the different transformers in ColumnTransformer.

like image 169
amiola Avatar answered Dec 15 '22 07:12

amiola