I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)


I found a little bit more info here and maybe how to improve your results: https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering. If you have not try it yet, maybe you could:
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With