Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Consistent ColumnTransformer for intersecting lists of columns

I want to use sklearn.compose.ColumnTransformer consistently (not parallel, so, the second transformer should be executed only after the first) for intersecting lists of columns in this way:

log_transformer = p.FunctionTransformer(lambda x: np.log(x))
df = pd.DataFrame({'a': [1,2, np.NaN, 4], 'b': [1,np.NaN, 3, 4], 'c': [1 ,2, 3, 4]})
compose.ColumnTransformer(n_jobs=1,
                         transformers=[
                             ('num', impute.SimpleImputer() , ['a', 'b']),
                             ('log', log_transformer, ['b', 'c']),
                             ('scale', p.StandardScaler(), ['a', 'b', 'c'])
                         ]).fit_transform(df)

So, I want to use SimpleImputer for 'a', 'b', then log for 'b', 'c', and then StandardScaler for 'a', 'b', 'c'.

But:

  1. I get array of (4, 7) shape.
  2. I still get Nan in a and b columns.

So, how can I use ColumnTransformer for different columns in the manner of Pipeline?

UPD:

pipe_1 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=42)),
])

pipe_2 = pipeline.Pipeline(steps=[
    ('imp', impute.SimpleImputer(strategy='constant', fill_value=24)),
])

pipe_3 = pipeline.Pipeline(steps=[
    ('scl', p.StandardScaler()),
])

# in the real situation I don't know exactly what cols these arrays contain, so they are not static: 
cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

proc = compose.ColumnTransformer(remainder='passthrough', transformers=[
    ('1', pipe_1, cols_1),
    ('2', pipe_2, cols_2),
    ('3', pipe_3, cols_3),
])
proc.fit_transform(df).T

Output:

array([[ 1.        ,  2.        , 42.        ,  4.        ],
       [ 1.        , 24.        ,  3.        ,  4.        ],
       [-1.06904497, -0.26726124,         nan,  1.33630621],
       [-1.33630621,         nan,  0.26726124,  1.06904497],
       [-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079]])

I understood why I have cols duplicates, nans and not scaled values, but how can I fix this in the correct way when cols are not static?

UPD2:

A problem may arise when the columns change their order. So, I want to use FunctionTransformer for columns selection:

def select_col(X, cols=None):
    return X[cols]

ct1 = compose.make_column_transformer(
    (p.OneHotEncoder(), p.FunctionTransformer(select_col, kw_args=dict(cols=['a', 'b']))),
    remainder='passthrough'
)

ct1.fit(df)

But get this output:

ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed

How can I fix it?

like image 827
konstantin_doncov Avatar asked Oct 15 '22 03:10

konstantin_doncov


2 Answers

The intended usage of ColumnTransformer is that the different transformers are applied in parallel, not sequentially. To accomplish your desired outcome, three approaches come to mind:

First approach:

pipe_a = Pipeline(steps=[('imp', SimpleImputer()),
                         ('scale', StandardScaler())])
pipe_b = Pipeline(steps=[('imp', SimpleImputer()),
                         ('log', log_transformer),
                         ('scale', StandardScaler())])
pipe_c = Pipeline(steps=[('log', log_transformer),
                         ('scale', StandardScaler())])
proc = ColumnTransformer(transformers=[
    ('a', pipe_a, ['a']),
    ('b', pipe_b, ['b']),
    ('c', pipe_c, ['c'])]
)

This second one actually won't work, because the ColumnTransformer will rearrange the columns and forget the names*, so that the later ones will fail or apply to the wrong columns. When sklearn finalizes how to pass along dataframes or feature names, this may be salvaged, or you may be able to tweak it for your specific usecase now. (* ColumnTransformer does already have a get_feature_names, but the actual data passed through the pipeline doesn't have that information.)

imp_tfm = ColumnTransformer(
    transformers=[('num', impute.SimpleImputer() , ['a', 'b'])],
    remainder='passthrough'
    )
log_tfm = ColumnTransformer(
    transformers=[('log', log_transformer, ['b', 'c'])],
    remainder='passthrough'
    )
scl_tfm = ColumnTransformer(
    transformers=[('scale', StandardScaler(), ['a', 'b', 'c'])
    )
proc = Pipeline(steps=[
    ('imp', imp_tfm),
    ('log', log_tfm),
    ('scale', scl_tfm)]
)

Third, there may be a way to use the Pipeline slicing feature to have one "master" pipeline that you cut down for each feature... this would work mostly like the first approach, might save some coding in the case of larger pipelines, but seems a little hacky. For example, here you can:

pipe_a = clone(pipe_b)[1:]
pipe_c = clone(pipe_b)
pipe_c.steps[1] = ('nolog', 'passthrough')

(Without cloning or otherwise deep-copying pipe_b, the last line would change both pipe_c and pipe_b. The slicing mechanism returns a copy, so pipe_a doesn't strictly need to be cloned, but I've left it in to feel safer. Unfortunately you can't provide a discontinuous slice, so pipe_c = pipe_b[0,2] doesn't work, but you can set the individual slices as I've done above to "passthrough" to disable them.)

like image 136
Ben Reiniger Avatar answered Oct 18 '22 11:10

Ben Reiniger


We can use little columns_name_to_index hack to convert column names to index and then we can pass the dataframe to the pipeline like this:

def columns_name_to_index(arr_of_names, df):
    return [df.columns.get_loc(c) for c in arr_of_names if c in df]

cols_1 = ['a']
cols_2 = ['b']
cols_3 = ['a', 'b', 'c']

ct1 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (impute.SimpleImputer(strategy='constant', fill_value=42), columns_name_to_index(cols_1, df)),
    (impute.SimpleImputer(strategy='constant', fill_value=24), columns_name_to_index(cols_2, df)),
])

ct2 = compose.ColumnTransformer(remainder='passthrough', transformers=[
    (p.StandardScaler(), columns_name_to_index(cols_3, df)),
])

pipe = pipeline.Pipeline(steps=[
    ('ct1', ct1),
    ('ct2', ct2),
])

pipe.fit_transform(df).T
like image 36
konstantin_doncov Avatar answered Oct 18 '22 09:10

konstantin_doncov