Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Oversampling: SMOTE for binary and categorical data in Python

I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?

like image 514
TTZ Avatar asked Dec 05 '17 14:12

TTZ


3 Answers

As per the documentation, this is now possible with the use of SMOTENC. SMOTE-NC is capable of handling a mix of categorical and continuous features.

Here is the code from the documentation:

from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)
like image 150
A.T.B Avatar answered Sep 19 '22 14:09

A.T.B


As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it.

For those with an academic interest in this ongoing issue, the paper (web archive) from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section 6.1.

Update: This feature has been implemented as of 21 Oct, 2018. Service request stands closed now.

like image 24
cph_sto Avatar answered Sep 21 '22 14:09

cph_sto


So as per documentation SMOTE doesn't support Categorical data in Python yet, and provides continuous outputs.

You can instead employ a workaround where you convert the categorical variables to integers and use SMOTE.

Then use np.round(X_train[categorical_variables]) to convert them back to the respective categorical values.

like image 22
mank Avatar answered Sep 17 '22 14:09

mank