Create a custom sklearn TransformerMixin that transforms categorical variables consistently

Tags:

This question is not a duplicate as someone suggested. Why? Because in that example, all possible values ARE KNOWN. In this example, they aren't. Further, this question - in addition to using a custom converter on unknown values - is asking specifically how to perform the transform in the same exact way as the initial transform. Once again I can tell I'll have to answer my own question eventually.

When creating a custom scikit-learn transformer, how can you guarantee or "force" the transform method to output only the columns it was fitted with originally?

Below illustrates. This is my example transformer.

import numpy as np
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.linear_model import LogisticRegression

class DFTransformer(TransformerMixin):

    def fit(self, df, y=None, **fit_params):
        return self

    def transform(self, df, **trans_params):
        self.df = df
        self.STACKER = pd.DataFrame()

        for col in self.df:
            dtype = self.df[col].dtype.name
            if dtype == 'object':
                self.STACKER = pd.concat([self.STACKER, self.get_dummies(col)], axis=1)
            elif dtype == 'int64':
                self.STACKER = pd.concat([self.STACKER, self.cut_it(col)], axis=1)

        return self.STACKER

    def get_dummies(self, name):
        return pd.get_dummies(self.df[name], prefix=name)

    def cut_it(self, name, bins=5):
        s = self.df[name].copy()
        return pd.get_dummies(pd.cut(s, bins), prefix=name)

Here's some dummy data. One of my methods uses pd.cut in-efforts to bin large ranges of ints or floats. Another method uses pd.get_dummies which turns unique values into columns.

df = pd.DataFrame({'integers': np.random.randint(2000, 20000, 30, dtype='int64'),
                   'categorical': np.random.choice(list('ABCDEFGHIJKLMNOP'), 30)},
                  columns=['integers', 'categorical'])

trans = DFTransformer()
X = trans.fit_transform(df)
y = np.random.binomial(1, 0.5, 30)
lr = LogisticRegression()
lr.fit(X, y)

X_test = pd.DataFrame({'integers': np.random.randint(2000, 60000, 30, dtype='int64'),
                   'categorical': np.random.choice(list('ABGIOPXYZ'), 30)},
                  columns=['integers', 'categorical'])
lr.predict(trans.transform(X_test))

The issue I'm having is that when I go and transform the "test" data (the data I'd like to make predictions on), it's highly likely that the conversion won't output the same exact columns due to the different categorical values (ex: obscure values that maybe appear once and then are never seen or heard from again).

For example, the above code produces this error:

Traceback (most recent call last):
  File "C:/Users/myname/Downloads/SO009949884.py", line 44, in <module>
    lr.predict(trans.transform(X_test))
  File "C:\python36\lib\site-packages\sklearn\linear_model\base.py", line 324, in predict
    scores = self.decision_function(X)
  File "C:\python36\lib\site-packages\sklearn\linear_model\base.py", line 305, in decision_function
    % (X.shape[1], n_features))
ValueError: X has 14 features per sample; expecting 20

Question: how do I go about ensuring my transform method transforms my test data the same way?

One bad solution I can think of is: transform training data, transform testing data, see where columns intersect, modify my transform function to limit output to those columns. Or, fill in blank columns for those that are missing. This is not scalable. Surely there's a better way? I don't want to have to know what the output columns must be before-hand.

My overall goal is to convert categorical variables in a consistent way across train and test datasets. I have 150+ columns to transform!

907

asked Jan 18 '18 11:01

Jarad

2 Answers

I made a blog post to address this. Below is the transformer I built.

class CategoryGrouper(BaseEstimator, TransformerMixin):  
    """A tranformer for combining low count observations for categorical features.

    This transformer will preserve category values that are above a certain
    threshold, while bucketing together all the other values. This will fix issues
    where new data may have an unobserved category value that the training data
    did not have.
    """

    def __init__(self, threshold=0.05):
        """Initialize method.

        Args:
            threshold (float): The threshold to apply the bucketing when
                categorical values drop below that threshold.
        """
        self.d = defaultdict(list)
        self.threshold = threshold

    def transform(self, X, **transform_params):
        """Transforms X with new buckets.

        Args:
            X (obj): The dataset to pass to the transformer.

        Returns:
            The transformed X with grouped buckets.
        """
        X_copy = X.copy()
        for col in X_copy.columns:
            X_copy[col] = X_copy[col].apply(lambda x: x if x in self.d[col] else 'CategoryGrouperOther')
        return X_copy

    def fit(self, X, y=None, **fit_params):
        """Fits transformer over X.

        Builds a dictionary of lists where the lists are category values of the
        column key for preserving, since they meet the threshold.
        """
        df_rows = len(X.index)
        for col in X.columns:
            calc_col = X.groupby(col)[col].agg(lambda x: (len(x) * 1.0) / df_rows)
            self.d[col] = calc_col[calc_col >= self.threshold].index.tolist()
        return self

Basically, the motivation originally came from me having to handle sparse category values, but then I realized this could be applied to unknown values. The transformer essentially groups sparse category values together, given a threshold, so since unknown values would inherit 0% of the value space, they would get bucketed into a CategoryGrouperOther group.

Here's just a demonstration of the transformer:

# dfs with 100 elements in cat1 and cat2
# note how df_test has elements 'g' and 't' in the respective categories (unknown values)
df_train = pd.DataFrame({'cat1': ['a'] * 20 + ['b'] * 30 + ['c'] * 40 + ['d'] * 3 + ['e'] * 4 + ['f'] * 3,
                         'cat2': ['z'] * 25 + ['y'] * 25 + ['x'] * 25 + ['w'] * 20 +['v'] * 5})
df_test = pd.DataFrame({'cat1': ['a'] * 10 + ['b'] * 20 + ['c'] * 5 + ['d'] * 50 + ['e'] * 10 + ['g'] * 5,
                        'cat2': ['z'] * 25 + ['y'] * 55 + ['x'] * 5 + ['w'] * 5 + ['t'] * 10})

catgrouper = CategoryGrouper()
catgrouper.fit(df_train)
df_test_transformed = catgrouper.transform(df_test)

df_test_transformed

    cat1    cat2
0   a   z
1   a   z
2   a   z
3   a   z
4   a   z
5   a   z
6   a   z
7   a   z
8   a   z
9   a   z
10  b   z
11  b   z
12  b   z
13  b   z
14  b   z
15  b   z
16  b   z
17  b   z
18  b   z
19  b   z
20  b   z
21  b   z
22  b   z
23  b   z
24  b   z
25  b   y
26  b   y
27  b   y
28  b   y
29  b   y
... ... ...
70  CategoryGrouperOther    y
71  CategoryGrouperOther    y
72  CategoryGrouperOther    y
73  CategoryGrouperOther    y
74  CategoryGrouperOther    y
75  CategoryGrouperOther    y
76  CategoryGrouperOther    y
77  CategoryGrouperOther    y
78  CategoryGrouperOther    y
79  CategoryGrouperOther    y
80  CategoryGrouperOther    x
81  CategoryGrouperOther    x
82  CategoryGrouperOther    x
83  CategoryGrouperOther    x
84  CategoryGrouperOther    x
85  CategoryGrouperOther    w
86  CategoryGrouperOther    w
87  CategoryGrouperOther    w
88  CategoryGrouperOther    w
89  CategoryGrouperOther    w
90  CategoryGrouperOther    CategoryGrouperOther
91  CategoryGrouperOther    CategoryGrouperOther
92  CategoryGrouperOther    CategoryGrouperOther
93  CategoryGrouperOther    CategoryGrouperOther
94  CategoryGrouperOther    CategoryGrouperOther
95  CategoryGrouperOther    CategoryGrouperOther
96  CategoryGrouperOther    CategoryGrouperOther
97  CategoryGrouperOther    CategoryGrouperOther
98  CategoryGrouperOther    CategoryGrouperOther
99  CategoryGrouperOther    CategoryGrouperOther

Even works when I set threshold to 0 (this will exclusively set unknown values to the 'other' group while preserving all the other category values). I would caution against setting threshold to 0 though, because your training dataset would not have the 'other' category so tweak the threshold to flag at least one value to be the 'other' group:

catgrouper = CategoryGrouper(threshold=0)
catgrouper.fit(df_train)
df_test_transformed = catgrouper.transform(df_test)

df_test_transformed

    cat1    cat2
0   a   z
1   a   z
2   a   z
3   a   z
4   a   z
5   a   z
6   a   z
7   a   z
8   a   z
9   a   z
10  b   z
11  b   z
12  b   z
13  b   z
14  b   z
15  b   z
16  b   z
17  b   z
18  b   z
19  b   z
20  b   z
21  b   z
22  b   z
23  b   z
24  b   z
25  b   y
26  b   y
27  b   y
28  b   y
29  b   y
... ... ...
70  d   y
71  d   y
72  d   y
73  d   y
74  d   y
75  d   y
76  d   y
77  d   y
78  d   y
79  d   y
80  d   x
81  d   x
82  d   x
83  d   x
84  d   x
85  e   w
86  e   w
87  e   w
88  e   w
89  e   w
90  e   CategoryGrouperOther
91  e   CategoryGrouperOther
92  e   CategoryGrouperOther
93  e   CategoryGrouperOther
94  e   CategoryGrouperOther
95  CategoryGrouperOther    CategoryGrouperOther
96  CategoryGrouperOther    CategoryGrouperOther
97  CategoryGrouperOther    CategoryGrouperOther
98  CategoryGrouperOther    CategoryGrouperOther
99  CategoryGrouperOther    CategoryGrouperOther

193

answered Sep 28 '22 11:09

Scratch'N'Purr

And like I said, answering my own question. Here's the solution I'm going with for now.

def get_datasets(df):
    trans1= DFTransformer()
    trans2= DFTransformer()
    train = trans1.fit_transform(df.iloc[:, :-1])
    test = trans2.fit_transform(pd.read_pickle(TEST_PICKLE_PATH))
    columns = train.columns.intersection(test.columns).tolist()
    X_train = train[columns]
    y_train = df.iloc[:, -1]
    X_test = test[columns]
    return X_train, y_train, X_test

answered Sep 28 '22 11:09

Jarad

Related questions
                            
                                How to link to root page in intersphinx
                            
                                Where is my python-flask app source stored on ec2 instance deployed with elastic beanstalk?
                            
                                Difference between Kivy and Toga (Beeware project) for Cross platform in Python
                            
                                TypeError: a bytes-like object is required, not 'str' in subprocess.check_output
                            
                                ModuleNotFoundError: No module named 'import_export'
                            
                                Is it safe to call `setup()` multiple times in a single `setup.py`?
                            
                                Missing table name in IntegrityError (Django ORM)
                            
                                Is it possible to annotate a seaborn violin plot with number of observations in each group?
                            
                                pandas DataFrame to_sql Python
                            
                                pandas grouper vs time grouper
                            
                                Does Jupyter support 'read-only' notebooks?
                            
                                Cannot run tensorflow on GPU
                            
                                Remove empty partitions in Dask
                            
                                Dealing with large numbers in R [Inf] and Python
                            
                                AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis', using pandas eval
                            
                                How to get previous frame of a video in opencv python
                            
                                How do I generate a sine wave using Python?
                            
                                Using Sql Server with Django 2.0
                            
                                Spacy NLP library: what is maximum reasonable document size
                            
                                Pandas extractall() - return list, not a MultiLevel index?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create a custom sklearn TransformerMixin that transforms categorical variables consistently

Tags:

python

scikit-learn

Jarad

People also ask

2 Answers

Scratch'N'Purr

Jarad

Recent Activity

Donate For Us