Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform SMOTE with cross validation in sklearn in python

I have a highly imbalanced dataset and would like to perform SMOTE to balance the dataset and perfrom cross validation to measure the accuracy. However, most of the existing tutorials make use of only single training and testing iteration to perfrom SMOTE.

Therefore, I would like to know the correct procedure to perfrom SMOTE using cross-validation.

My current code is as follows. However, as mentioned above it only uses single iteration.

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())
clf_rf = RandomForestClassifier(n_estimators=25, random_state=12)
clf_rf.fit(x_train_res, y_train_res)

I am happy to provide more details if needed.

like image 338
EmJ Avatar asked Apr 09 '19 10:04

EmJ


People also ask

How do I use sklearn cross-validation?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn.model_selection import cross_val_score >>> clf = svm.SVC(kernel='linear', C=1, random_state=42) >>> scores = cross_val_score(clf, X, y, cv=5) >>> scores array([0.96..., 1.

Can we use smote for undersampling?

Combination of Oversampling and Undersampling techniques:SMOTE is one of the famous oversampling techniques and is very effective in handling class imbalance. The idea is to combine SMOTE with some undersampling techniques (ENN, Tomek) to increase the effectiveness of handling the imbalanced class.

How do you handle smote data in imbalanced classification problems?

When dealing with imbalanced data sets there are three common techniques to balance the data: under-sampling the majority class. over-sampling the minority classes. combination of the under-sampling the majority class and over-sampling the minority class.


1 Answers

You need to perform SMOTE within each fold. Accordingly, you need to avoid train_test_split in favour of KFold:

from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = y[train_index]  # Based on your code, you might need a ravel call here, but I would look into how you're generating your y
    X_test = X[test_index]
    y_test = y[test_index]  # See comment on ravel and  y_train
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model = ...  # Choose a model here
    model.fit(X_train_oversampled, y_train_oversampled )  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_test, y_pred)}')

You can also, for example, append the scores to a list defined outside.

like image 125
gmds Avatar answered Nov 03 '22 00:11

gmds