Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

Question

I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.

Here is what I've tried (and the result):

Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)

Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)

Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)

XGBoost (slightly better recall on the out-of-year test set, 0.11)

What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:

from imblearn.pipeline import make_pipeline, Pipeline

clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)

pipeline = make_pipeline(??)

pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)

score_dist

Jordan Delbar · Accepted Answer

Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.

I found a little bit more info here and maybe how to improve your results: https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf

When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:

https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/

For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering. If you have not try it yet, maybe you could:

Downsample your data
Normalize all your data with a StandardScaler

Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression

# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
         ('log_reg', LogisticRegression())]

pipeline = Pipeline(steps)

# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
              'penalty':['l1', 'l2']}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
random_state=42)

# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training set
cv.fit(X_train, y_train)

Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score

from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

param_grid = {'C': [1, 10, 100]}

clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")

pipeline = Pipeline([('scale', StandardScaler()),
                     ('SMOTEENN', sme),
                     ('grid', grid)])

cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)

I hope this may help you.

(edit: I added score = "f1" in the GridSearchCV)

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

Tags:

python

scikit-learn

random-forest

Man

1 Answers

Jordan Delbar

Recent Activity

Donate For Us

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

Tags:

python

scikit-learn

random-forest

Man

1 Answers

Jordan Delbar

Related questions

Recent Activity

Donate For Us