Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

customized transformerMixin with data labels in sklearn

I'm working on a small project where I'm trying to apply SMOTE "Synthetic Minority Over-sampling Technique", where my data is imbalanced ..

I created a customized transformerMixin for the SMOTE function ..

class smote(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print(X.shape, ' ', type(X)) # (57, 28)   <class 'numpy.ndarray'>
        print(len(y), ' ', type)     #    57      <class 'list'>
        smote = SMOTE(kind='regular', n_jobs=-1)
        X, y = smote.fit_sample(X, y)

        return X

    def transform(self, X):
        return X

model = Pipeline([
        ('posFeat1', featureVECTOR()),
        ('sca1', StandardScaler()),
        ('smote', smote()),
        ('classification', SGDClassifier(loss='hinge', max_iter=1, random_state = 38, tol = None))
    ])
    model.fit(train_df, train_df['label'].values.tolist())
    predicted = model.predict(test_df)

I implemented the SMOTE on the FIT function because I don't want it to be applied on the test data ..

and unfortunately, I got this error:

     model.fit(train_df, train_df['label'].values.tolist())
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 248, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 213, in _fit
    **fit_params_steps[name])
  File "C:\Python35\lib\site-packages\sklearn\externals\joblib\memory.py", line 362, in __call__
    return self.func(*args, **kwargs)
  File "C:\Python35\lib\site-packages\sklearn\pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Python35\lib\site-packages\sklearn\base.py", line 520, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
AttributeError: 'numpy.ndarray' object has no attribute 'transform'
like image 821
Minions Avatar asked Apr 11 '18 09:04

Minions


1 Answers

fit() mehtod should return self, not the transformed values. If you need the functioning only for train data and not test, then implement the fit_transform() method.

class smote(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        print(X.shape, ' ', type(X)) # (57, 28)   <class 'numpy.ndarray'>
        print(len(y), ' ', type)     #    57      <class 'list'>
        self.smote = SMOTE(kind='regular', n_jobs=-1).fit(X, y)

        return self

    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.smote.sample(X, y)

    def transform(self, X):
        return X

Explanation: On the train data (i.e. when pipeline.fit() is called) Pipeline will first try to call fit_transform() on the internal objects. If not found, then it will call fit() and transform() separately.

On the test data, only the transform() is called for each internal object, so here your supplied test data should not be changed.

Update: The above code will still throw error. You see, when you oversample the supplied data, the number of samples in X and y both change. But the pipeline will only work on the X data. It will not change the y. So either you will get error about unmatched samples to labels if I correct the above error. If by chance, the generated samples are equal to previous samples, then also the y values will not correspond to the new samples.

Working solution: Silly me.

You can just use the Pipeline from the imblearn package in place of scikit-learn Pipeline. It takes care automatically to re-sample when called fit() on the pipeline, and does not re-sample test data (when called transform() or predict()).

Actually I knew that imblearn.Pipeline handles sample() method, but was thrown off when you implemented a custom class and said that test data must not change. It did not come to my mind that thats the default behaviour.

Just replace

from sklearn.pipeline import Pipeline

with

from imblearn.pipeline import Pipeline

and you are all set. No need to make a custom class as you did. Just use original SMOTE. Something like:

random_state = 38
model = Pipeline([
        ('posFeat1', featureVECTOR()),
        ('sca1', StandardScaler()),

        # Original SMOTE class
        ('smote', SMOTE(random_state=random_state)),
        ('classification', SGDClassifier(loss='hinge', max_iter=1, random_state=random_state, tol=None))
    ])
like image 71
Vivek Kumar Avatar answered Nov 19 '22 21:11

Vivek Kumar