Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn custom transformer / pipeline that changes X and Y

I have a set of N data points X = {x1, ..., xn} and a set of N target values / classes Y = {y1, ..., yn}.

The feature vector for a given yi is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. xi-4, xi-3, xi-2, xi-1 for prediction of yi.

Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point xn.

This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.

Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)

I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.

like image 846
Matt M. Avatar asked Jan 11 '16 17:01

Matt M.


2 Answers

You are correct, you cannot adjust the your target within a sklearn Pipeline. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y and then manually loop through a tuning grid created via ParameterGrid. If this doesn't make sense to you edit your post with the code you have for further assistance.

like image 160
David Avatar answered Sep 28 '22 02:09

David


I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.

I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform method when the pipeline calls fit_transform. Then, the transform method passes on a tuple (X, self.y_stored) which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!

Something like this:


class MyWrapperEstimator(RealEstimator):
    def fit(X, y=None):
        if isinstance(X, tuple):
            X, y = X
        super().fit(X=X, y=y)
like image 29
Petio Petrov Avatar answered Sep 28 '22 04:09

Petio Petrov