I am trying to build regressors that encapsulate the process of
Ideally, the end user should be able to use the regressor without knowing the internals of the target conversions. The developer is expected to provide functions that implement the transform and inverse-transform logic.
With the help of sklearn.compose.TransformedTargetRegressor
I was able to build a linear regression model that accepts timestamps as targets and internally converts them to seconds evolved since 1970-01-01 00:00:00 (Unix epoch). The fit
and predict
methods already work as expected.
import pandas as pd
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
_check_inverse = False
# helper to convert a 2D numpy array of timestamps to a 2D array of seconds
def _to_float(timestamps):
deltas = pd.DataFrame(timestamps).sub(pd.Timestamp(0))
return deltas.apply(lambda s: s.dt.total_seconds()).values
# helper to convert a 2D numpy array of seconds to a 2D array of timestamps
def _to_timestamp(seconds):
return pd.DataFrame(seconds).apply(pd.to_datetime, unit='s').values
# build transformer from helper functions
time_transformer = FunctionTransformer(
func=_to_float,
inverse_func=_to_timestamp,
validate=True,
check_inverse=_check_inverse
)
# build TransformedTargetRegressor
tt_reg = TransformedTargetRegressor(
regressor=LinearRegression(),
transformer=time_transformer,
check_inverse=_check_inverse
)
Usage:
>>> import numpy as np
>>> X = np.array([[1], [2], [3]], dtype=float)
>>> y = pd.date_range(start=0, periods=3, freq='min')
>>> tt_reg = tt_reg.fit(X, y)
>>> tt_reg.predict(X)
array(['1970-01-01T00:00:00.000000000', '1970-01-01T00:01:00.000000000',
'1970-01-01T00:02:00.000000000'], dtype='datetime64[ns]')
However, methods that use the result of predict
internally such as score
(and possibly other methods of more complex sklearn regressors) fail because they can't handle the output of _to_timestamp
:
>>> tt_reg.score(X, y)
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\actualpanda\.virtualenvs\SomeProject--3333Ox_\lib\site-packages\sklearn\base.py", line 435, in score
return r2_score(y, y_pred, sample_weight=sample_weight,
File "C:\Users\actualpanda\.virtualenvs\SomeProject--3333Ox_\lib\site-packages\sklearn\metrics\_regression.py", line 591, in r2_score
numerator = (weight * (y_true - y_pred) ** 2).sum(axis=0,
TypeError: ufunc 'square' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
In order to get the score, the user must know the internals of tt_reg.regressor_
.
>>> tt_reg.regressor_.score(X, y.to_series().sub(pd.Timestamp(0)).dt.total_seconds())
1.0
Is there a feasible way to build robust, user friendly sklearn regressors that can deal with non-numeric targets and don't leak their internals?
Updating the score
method might solve your problem, as mentioned in comments.
from sklearn.utils import check_array
class MyTransformedTargetRegressor(TransformedTargetRegressor):
def score(self, X, y):
y = check_array(y, accept_sparse=False, force_all_finite=True,
ensure_2d=False)
if y.ndim == 1:
y_2d = y.reshape(-1, 1)
else:
y_2d = y
y_trans = self.transformer_.transform(y_2d)
if y_trans.ndim == 2 and y_trans.shape[1] == 1:
y_trans = y_trans.squeeze(axis=1)
return self.regressor_.score(X, y_trans)
Let us try with a different regressor
from sklearn.ensemble import BaggingRegressor
tt_reg = MyTransformedTargetRegressor(
regressor=BaggingRegressor(),
transformer=time_transformer,
check_inverse=_check_inverse
)
import numpy as np
n_samples =10000
X = np.arange(n_samples).reshape(-1,1)
y = pd.date_range(start=0, periods=n_samples, freq='min')
tt_reg = tt_reg.fit(X, y)
tt_reg.predict(X)
print(tt_reg.score(X, y))
# 0.9999999891236799
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With