I wonder if it is possible to use a MultilabelBinarizer within a ColumnTransformer.
I have a toy pandas dataframe like:
df = pd.DataFrame({"id":[1,2,3],
"text": ["some text", "some other text", "yet another text"],
"label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]]})
preprocess = ColumnTransformer(
[
('vectorizer', CountVectorizer(), 'text'),
('binarizer', MultiLabelBinarizer(), ['label']),
],
remainder='drop')
this code, however, throws an exception:
~/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
714 with _print_elapsed_time(message_clsname, message):
715 if hasattr(transformer, 'fit_transform'):
--> 716 res = transformer.fit_transform(X, y, **fit_params)
717 else:
718 res = transformer.fit(X, y, **fit_params).transform(X)
TypeError: fit_transform() takes 2 positional arguments but 3 were given
With OneHotEncoder the ColumnTransformer does work.
This transformer converts between this intuitive format and the supported multilabel format: a (samples x classes) binary matrix indicating the presence of a class label. Indicates an ordering for the class labels.
By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop' ). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through.
A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.
The ColumnTransformer constructor takes quite a few arguments, but we’re only interested in two. The first argument is an array called transformers, which is a list of tuples. The array has the following elements in the same order: name: a name for the column transformer, which will make setting of parameters and searching of the transformer easy.
For input X
, MultiLabelBinarizer
is suited to deal with one column at a time (as each row is supposed to be a sequence of categories), while OneHotEncoder
can deal with multiple columns. To make a ColumnTransformer
compatible MultiHotEncoder
, you will need to iterate through all columns of X
and fit/transform each column with a MultiLabelBinarizer
. The following should work with pandas.DataFrame
input.
from sklearn.base import BaseEstimator, TransformerMixin
class MultiHotEncoder(BaseEstimator, TransformerMixin):
"""Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`. Note
that input X has to be a `pandas.DataFrame`.
"""
def __init__(self):
self.mlbs = list()
self.n_columns = 0
self.categories_ = self.classes_ = list()
def fit(self, X:pd.DataFrame, y=None):
for i in range(X.shape[1]): # X can be of multiple columns
mlb = MultiLabelBinarizer()
mlb.fit(X.iloc[:,i])
self.mlbs.append(mlb)
self.classes_.append(mlb.classes_)
self.n_columns += 1
return self
def transform(self, X:pd.DataFrame):
if self.n_columns == 0:
raise ValueError('Please fit the transformer first.')
if self.n_columns != X.shape[1]:
raise ValueError(f'The fit transformer deals with {self.n_columns} columns '
f'while the input has {X.shape[1]}.'
)
result = list()
for i in range(self.n_columns):
result.append(self.mlbs[i].transform(X.iloc[:,i]))
result = np.concatenate(result, axis=1)
return result
# test
temp = pd.DataFrame({
"id":[1,2,3],
"text": ["some text", "some other text", "yet another text"],
"label": [["white", "cat"], ["black", "cat"], ["brown", "dog"]],
"label2": [["w", "c"], ["b", "c"], ["b", "d"]]
})
col_transformer = ColumnTransformer([
('one-hot', OneHotEncoder(), ['id','text']),
('multi-hot', MultiHotEncoder(), ['label', 'label2'])
])
col_transformer.fit_transform(temp)
and you should get:
array([[1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 1.],
[0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0.],
[0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0.]])
Note how the first 3 and second 3 columns are one-hot coded while the following 5 and last 4 are multi-hot coded. And the categories info can be found as you normally do:
col_transformer.named_transformers_['one-hot'].categories_
>>> [array([1, 2, 3], dtype=object),
array(['some other text', 'some text', 'yet another text'], dtype=object)]
col_transformer.named_transformers_['multi-hot'].categories_
>>> [array(['black', 'brown', 'cat', 'dog', 'white'], dtype=object),
array(['b', 'c', 'd', 'w'], dtype=object)]
I wasn't particularly diligent in my testing to know exactly why the below works, but I was able to build a custom <Transformer>
that essentially "wraps" the MultiLabelBinarizer
but is also compatible with <ColumnTransformer>
:
class MultiLabelBinarizerFixedTransformer(BaseEstimator, TransformerMixin):
"""
Wraps `MultiLabelBinarizer` in a form that can work with `ColumnTransformer`
"""
def __init__(
self
):
self.feature_name = ["mlb"]
self.mlb = MultiLabelBinarizer(sparse_output=False)
def fit(self, X, y=None):
self.mlb.fit(X)
return self
def transform(self, X):
return self.mlb.transform(X)
def get_feature_names(self, input_features=None):
cats = self.mlb.classes_
if input_features is None:
input_features = ['x%d' % i for i in range(len(cats))]
print(input_features)
elif len(input_features) != len(self.categories_):
raise ValueError(
"input_features should have length equal to number of "
"features ({}), got {}".format(len(self.categories_),
len(input_features)))
feature_names = [f"{input_features[i]}_{cats[i]}" for i in range(len(cats))]
return np.array(feature_names, dtype=object)
My hunch is that MultiLabelBinarizer
uses a different set of inputs for transform()
than the <ColumnTransformer>
expects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With