I have an imbalanced dataset containing binary classification problem.I have built Random Forest Classifier and used k fold cross validation with 10 folds.
kfold = model_selection.KFold(n_splits=10, random_state=42)
model=RandomForestClassifier(n_estimators=50)
I got the results of the 10 folds
results = model_selection.cross_val_score(model,features,labels, cv=kfold)
print results
[ 0.60666667 0.60333333 0.52333333 0.73 0.75333333 0.72 0.7
0.73 0.83666667 0.88666667]
I have calculated accuracy by taking mean and standard deviation of the results
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)
Accuracy: 70.900% (10.345%)
I have computed my predictions as follows
predictions = cross_val_predict(model, features,labels ,cv=10)
Since this is an imbalanced dataset,I would like to calculate precision,recall and f1 score of each fold and average the results. How to calculate the values in python?
The F1 score: combining Precision and Recall Precision and Recall are the two building blocks of the F1 score. The goal of the F1 score is to combine the precision and recall metrics into a single metric. At the same time, the F1 score has been designed to work well on imbalanced data.
When you use cross_val_score
method, you can specify, which scorings you can calculate on each fold:
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
scoring = {'accuracy' : make_scorer(accuracy_score),
'precision' : make_scorer(precision_score),
'recall' : make_scorer(recall_score),
'f1_score' : make_scorer(f1_score)}
kfold = model_selection.KFold(n_splits=10, random_state=42)
model=RandomForestClassifier(n_estimators=50)
results = model_selection.cross_val_score(estimator=model,
X=features,
y=labels,
cv=kfold,
scoring=scoring)
After cross validation, you will get results
dictionary with keys: 'accuracy', 'precision', 'recall', 'f1_score', which store metrics values on each fold for certain metric. For each metric you can calculate mean and std value by using np.mean(results[value])
and np.std(results[value])
, where value - one of your specified metric name.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With