I'm using Pipeline
and ColumnTransformer
modules from sklearn
library to perform feature engineering on my dataset.
The dataset initially looks like this:
date | date_block_num | shop_id | item_id | item_price |
---|---|---|---|---|
02.01.2013 | 0 | 59 | 22154 | 999.00 |
03.01.2013 | 0 | 25 | 2552 | 899.00 |
05.01.2013 | 0 | 25 | 2552 | 899.00 |
06.01.2013 | 0 | 25 | 2554 | 1709.05 |
15.01.2013 | 0 | 25 | 2555 | 1099.00 |
$> data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 date object
1 date_block_num object
2 shop_id object
3 item_id object
4 item_price float64
dtypes: float64(2), int64(3), object(1)
memory usage: 134.4+ MB
Then I have the following transformations:
num_column_transformer = ColumnTransformer(
transformers=[
("std_scaler", StandardScaler(), make_column_selector(dtype_include=np.number)),
],
remainder="passthrough"
)
num_pipeline = Pipeline(
steps=[
("percent_item_cnt_day_per_shop", PercentOverTotalAttributeWholeAdder(
attribute_percent_name="shop_id",
attribute_total_name="item_cnt_day",
new_attribute_name="%_item_cnt_day_per_shop")
),
("percent_item_cnt_day_per_item", PercentOverTotalAttributeWholeAdder(
attribute_percent_name="item_id",
attribute_total_name="item_cnt_day",
new_attribute_name="%_item_cnt_day_per_item")
),
("percent_sales_per_shop", SalesPerAttributeOverTotalSalesAdder(
attribute_percent_name="shop_id",
new_attribute_name="%_sales_per_shop")
),
("percent_sales_per_item", SalesPerAttributeOverTotalSalesAdder(
attribute_percent_name="item_id",
new_attribute_name="%_sales_per_item")
),
("num_column_transformer", num_column_transformer),
]
)
The first four Transformers
create four new different numeric variables and the last one applies StandardScaler
over all the numerical values of the dataset.
After executing it, I get the following data:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
-0.092652 | -0.765612 | -0.173122 | -0.756606 | -0.379775 | 02.01.2013 | 0 | 59 | 22154 |
-0.092652 | 1.557684 | -0.175922 | 1.563224 | -0.394319 | 03.01.2013 | 0 | 25 | 2552 |
-0.856351 | 1.557684 | -0.175922 | 1.563224 | -0.394319 | 05.01.2013 | 0 | 25 | 2552 |
-0.092652 | 1.557684 | -0.17613 | 1.563224 | -0.396646 | 06.01.2013 | 0 | 25 | 2554 |
-0.092652 | 1.557684 | -0.173278 | 1.563224 | -0.380647 | 15.01.2013 | 0 | 25 | 2555 |
I'd like to have the following output:
date | date_block_num | shop_id | item_id | item_price | %_item_cnt_day_per_shop | %_item_cnt_day_per_item | %_sales_per_shop | %_sales_per_item |
---|---|---|---|---|---|---|---|---|
02.01.2013 | 0 | 59 | 22154 | -0.092652 | -0.765612 | -0.173122 | -0.756606 | -0.379775 |
03.01.2013 | 0 | 25 | 2552 | -0.092652 | 1.557684 | -0.175922 | 1.563224 | -0.394319 |
05.01.2013 | 0 | 25 | 2552 | -0.856351 | 1.557684 | -0.175922 | 1.563224 | -0.394319 |
06.01.2013 | 0 | 25 | 2554 | -0.092652 | 1.557684 | -0.17613 | 1.563224 | -0.396646 |
15.01.2013 | 0 | 25 | 2555 | -0.092652 | 1.557684 | -0.173278 | 1.563224 | -0.380647 |
As you can see, columns 5
, 6
, 7
, and 8
from the output corresponds to the first four columns in the original dataset. For example, I don't know where the item_price
feature lies in the outputted table.
There's one point to be aware of when dealing with ColumnTransformer
, which is reported within the doc as follows:
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list.
That's the reason why your ColumnTransformer
instance messes things up. Indeed, consider this simplified example which resembles your setting:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
df = pd.DataFrame({
'date': ['02.01.2013', '03.01.2013', '05.01.2013', '06.01.2013', '15.01.2013'],
'date_block_num': ['0', '0', '0', '0', '0'],
'shop_id': ['59', '25', '25', '25', '25'],
'item_id': ['22514', '2252', '2252', '2254', '2255'],
'item_price': [999.00, 899.00, 899.00, 1709.05, 1099.00]})
ct = ColumnTransformer([
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))],
remainder='passthrough')
pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())
As you might notice, the first column in the transformed dataframe turns out to be the numeric one, i.e. the one which undergoes the scaling (and the first in the transformers list).
Conversely, here's an example of how you can bypass such issue by postponing the scaling on numeric variables after passing through all the string variables and thus ensuring the possibility of getting the columns in your desired order:
ct = ColumnTransformer([
('pass', 'passthrough', make_column_selector(dtype_include=object)),
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
])
pd.DataFrame(ct.fit_transform(df), columns=ct.get_feature_names_out())
To complete the picture, here is an attempt to reproduce your Pipeline (though the custom transformer is for sure slightly different from yours):
from sklearn.base import BaseEstimator, TransformerMixin
class PercentOverTotalAttributeWholeAdder(BaseEstimator, TransformerMixin):
def __init__(self, attribute_percent_name='shop_id', new_attribute_name='%_item_cnt_day_per_shop'):
self.attribute_percent_name = attribute_percent_name
self.new_attribute_name = new_attribute_name
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
df[self.new_attribute_name] = df.groupby(by=self.attribute_percent_name)[self.attribute_percent_name].transform('count') / df.shape[0]
return df
ct_pipe = ColumnTransformer([
('pass', 'passthrough', make_column_selector(dtype_include=object)),
('std_scaler', StandardScaler(), make_column_selector(dtype_include=np.number))
], verbose_feature_names_out=False)
pipe = Pipeline([
('percent_item_cnt_day_per_shop', PercentOverTotalAttributeWholeAdder(
attribute_percent_name='shop_id',
new_attribute_name='%_item_cnt_day_per_shop')
),
('percent_item_cnt_day_per_item', PercentOverTotalAttributeWholeAdder(
attribute_percent_name='item_id',
new_attribute_name='%_item_cnt_day_per_item')
),
('column_trans', ct_pipe),
])
pd.DataFrame(pipe.fit_transform(df), columns=pipe[-1].get_feature_names_out())
As a final remark, observe that the verbose_feature_names_out=False
parameter ensures that the names of the columns of the transformed dataframe do not show prefixes which refer to the different transformers in ColumnTransformer
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With