I have a set of N data points X = {x1, ..., xn} and a set of N target values / classes Y = {y1, ..., yn}.
The feature vector for a given yi is constructed taking into account a "window" (for lack of a better term) of data points, e.g. I might want to stack "the last 4 data points", i.e. xi-4, xi-3, xi-2, xi-1 for prediction of yi.
Obviously for a window size of 4 such a feature vector cannot be constructed for the first three target values and I would like to simply drop them. Likewise for the last data point xn.
This would not be a problem, except I want this to take place as part of a sklearn pipeline. So far I have successfully written a few custom transformers for other tasks, but those cannot (as far as I know) change the Y matrix.
Is there a way to do this, that I am unaware of or am I stuck doing this as preprocessing outside of the pipeline? (Which means, I would not be able to use GridsearchCV to find the optimal window size and shift.)
I have tried searching for this, but all I came up with was this question, which deals with removing samples from the X matrix. The accepted answer there makes me think, what I want to do is not supported in scikit-learn, but I wanted to make sure.
You are correct, you cannot adjust the your target within a sklearn Pipeline
. That doesn't mean that you cannot do a gridsearch, but it does mean that you may have to go about it in a bit more of a manual fashion. I would recommend writing a function do your transformations and filtering on y
and then manually loop through a tuning grid created via ParameterGrid
. If this doesn't make sense to you edit your post with the code you have for further assistance.
I am struggling with a similar issue and find it unfortunate that you cannot pass on the y-values between transformers. That being said, I bypassed the issue in a bit of a dirty way.
I am storing the y-values as an instance attribute of the transformers. That way I can access them in the transform
method when the pipeline calls fit_transform
. Then, the transform
method passes on a tuple (X, self.y_stored)
which is expected by the next estimator. This means I have to write wrapper estimators and it's very ugly, but it works!
Something like this:
class MyWrapperEstimator(RealEstimator):
def fit(X, y=None):
if isinstance(X, tuple):
X, y = X
super().fit(X=X, y=y)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With