I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?
As per the documentation, this is now possible with the use of SMOTENC. SMOTE-NC is capable of handling a mix of categorical and continuous features.
Here is the code from the documentation:
from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)
As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it.
For those with an academic interest in this ongoing issue, the paper (web archive) from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section 6.1.
Update: This feature has been implemented as of 21 Oct, 2018. Service request stands closed now.
So as per documentation SMOTE doesn't support Categorical data in Python yet, and provides continuous outputs.
You can instead employ a workaround where you convert the categorical variables to integers and use SMOTE.
Then use np.round(X_train[categorical_variables])
to convert them back to the respective categorical values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With