Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.

Here is what I've tried (and the result):

Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)

Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)

Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)

XGBoost (slightly better recall on the out-of-year test set, 0.11)

What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:

from imblearn.pipeline import make_pipeline, Pipeline

clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)

pipeline = make_pipeline(??)

pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)

score_distscore_dist

like image 817
Man Avatar asked Jan 21 '26 13:01

Man


1 Answers

  • Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.

I found a little bit more info here and maybe how to improve your results: https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf

  • When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.

  • Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:

https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/

For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering. If you have not try it yet, maybe you could:

  • Downsample your data
  • Normalize all your data with a StandardScaler
  • Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression

    # Define steps of the pipeline
    steps = [('scaler', StandardScaler()),
             ('log_reg', LogisticRegression())]
    
    pipeline = Pipeline(steps)
    
    # Specify the hyperparameters
    parameters = {'C':[1, 10, 100],
                  'penalty':['l1', 'l2']}
    
    # Create train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
    random_state=42)
    
    # Instantiate a GridSearchCV object: cv
    cv = GridSearchCV(pipeline, param_grid=parameters)
    
    # Fit to the training set
    cv.fit(X_train, y_train)
    

Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score

from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

param_grid = {'C': [1, 10, 100]}

clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")

pipeline = Pipeline([('scale', StandardScaler()),
                     ('SMOTEENN', sme),
                     ('grid', grid)])

cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)

I hope this may help you.

(edit: I added score = "f1" in the GridSearchCV)

like image 91
Jordan Delbar Avatar answered Jan 23 '26 04:01

Jordan Delbar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!