Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to build user friendly sklearn regressors that can handle non-numeric targets?

Goal

I am trying to build regressors that encapsulate the process of

  1. transform the target from a non-numeric to a numeric format
  2. internally, use numbers for all calculations
  3. inverse-transform numeric-values back to the original format before presenting them to the user.

Ideally, the end user should be able to use the regressor without knowing the internals of the target conversions. The developer is expected to provide functions that implement the transform and inverse-transform logic.

Prototype Demo

With the help of sklearn.compose.TransformedTargetRegressor I was able to build a linear regression model that accepts timestamps as targets and internally converts them to seconds evolved since 1970-01-01 00:00:00 (Unix epoch). The fit and predict methods already work as expected.

import pandas as pd
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer

_check_inverse = False

# helper to convert a 2D numpy array of timestamps to a 2D array of seconds
def _to_float(timestamps):
    deltas = pd.DataFrame(timestamps).sub(pd.Timestamp(0))
    return deltas.apply(lambda s: s.dt.total_seconds()).values

# helper to convert a 2D numpy array of seconds to a 2D array of timestamps
def _to_timestamp(seconds):
    return pd.DataFrame(seconds).apply(pd.to_datetime, unit='s').values

# build transformer from helper functions
time_transformer = FunctionTransformer(
    func=_to_float,
    inverse_func=_to_timestamp,
    validate=True,
    check_inverse=_check_inverse
)

# build TransformedTargetRegressor
tt_reg = TransformedTargetRegressor(
    regressor=LinearRegression(),
    transformer=time_transformer,
    check_inverse=_check_inverse
)

Usage:

>>> import numpy as np
>>> X = np.array([[1], [2], [3]], dtype=float)
>>> y = pd.date_range(start=0, periods=3, freq='min')
>>> tt_reg = tt_reg.fit(X, y)
>>> tt_reg.predict(X)
array(['1970-01-01T00:00:00.000000000', '1970-01-01T00:01:00.000000000',
       '1970-01-01T00:02:00.000000000'], dtype='datetime64[ns]')

However, methods that use the result of predict internally such as score (and possibly other methods of more complex sklearn regressors) fail because they can't handle the output of _to_timestamp:

>>> tt_reg.score(X, y)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Users\actualpanda\.virtualenvs\SomeProject--3333Ox_\lib\site-packages\sklearn\base.py", line 435, in score
    return r2_score(y, y_pred, sample_weight=sample_weight,
  File "C:\Users\actualpanda\.virtualenvs\SomeProject--3333Ox_\lib\site-packages\sklearn\metrics\_regression.py", line 591, in r2_score
    numerator = (weight * (y_true - y_pred) ** 2).sum(axis=0,
TypeError: ufunc 'square' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In order to get the score, the user must know the internals of tt_reg.regressor_.

>>> tt_reg.regressor_.score(X, y.to_series().sub(pd.Timestamp(0)).dt.total_seconds())
1.0

Question

Is there a feasible way to build robust, user friendly sklearn regressors that can deal with non-numeric targets and don't leak their internals?

like image 574
actual_panda Avatar asked Mar 03 '23 12:03

actual_panda


1 Answers

Updating the score method might solve your problem, as mentioned in comments.

from sklearn.utils import check_array


class MyTransformedTargetRegressor(TransformedTargetRegressor):

    def score(self, X, y):
        y = check_array(y, accept_sparse=False, force_all_finite=True,
                        ensure_2d=False)
        if y.ndim == 1:
            y_2d = y.reshape(-1, 1)
        else:
            y_2d = y
        y_trans = self.transformer_.transform(y_2d)

        if y_trans.ndim == 2 and y_trans.shape[1] == 1:
            y_trans = y_trans.squeeze(axis=1)

        return self.regressor_.score(X, y_trans)

Let us try with a different regressor

from sklearn.ensemble import BaggingRegressor
tt_reg = MyTransformedTargetRegressor(
    regressor=BaggingRegressor(),
    transformer=time_transformer,
    check_inverse=_check_inverse
)


import numpy as np
n_samples =10000
X = np.arange(n_samples).reshape(-1,1)
y = pd.date_range(start=0, periods=n_samples, freq='min')
tt_reg = tt_reg.fit(X, y)
tt_reg.predict(X)
print(tt_reg.score(X, y))

# 0.9999999891236799
like image 168
Venkatachalam Avatar answered Mar 05 '23 19:03

Venkatachalam