Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn: Have an estimator that filters samples

I'm trying to implement my own Imputer. Under certain conditions, I would like to filter some of the train samples (that I deem low quality).

However, since the transform method returns only X and not y, and y itself is a numpy array (which I can't filter in place to the best of my knowledge), and moreover - when I use GridSearchCV- the y my transform method receives is None, I can't seem to find a way to do it.

Just to clarify: I'm perfectly clear on how to filter arrays. I can't find a way to fit sample filtering on the y vector into the current API.

I really want to do that from a BaseEstimator implementation so that I could use it with GridSearchCV (it has a few parameters). Am I missing a different way to achieve sample filtration (not through BaseEstimator, but GridSearchCV compliant)? is there some way around the current API?

like image 213
Korem Avatar asked Jul 22 '14 19:07

Korem


1 Answers

The scikit-learn transformer API is made for changing the features of the data (in nature and possibly in number/dimension), but not for changing the number of samples. Any transformer that drops or adds samples is, as of the existing versions of scikit-learn, not compliant with the API (possibly a future addition if deemed important).

So in view of this it looks like you will have to work your way around standard scikit-learn API.

like image 179
eickenberg Avatar answered Sep 29 '22 00:09

eickenberg