I am creating a pipeline in scikit learn,
pipeline = Pipeline([ ('bow', CountVectorizer()), ('classifier', BernoulliNB()), ])
and computing the accuracy using cross validation
scores = cross_val_score(pipeline, # steps to convert raw messages into models train_set, # training data label_train, # training labels cv=5, # split data randomly into 10 parts: 9 for training, 1 for scoring scoring='accuracy', # which scoring metric? n_jobs=-1, # -1 = use all cores = faster )
How can I report confusion matrix instead of 'accuracy'?
To enjoy the benefits of cross-validation you don't have to split the data manually. Sklearn offers two methods for quick evaluation using cross-validation. cross-val-score returns a list of model scores and cross-validate also reports training times.
Here are some of the most common performance measures you can use from the confusion matrix. Accuracy: It gives you the overall accuracy of the model, meaning the fraction of the total samples that were correctly classified by the classifier. To calculate accuracy, use the following formula: (TP+TN)/(TP+TN+FP+FN).
Count the number of matches. Divide it by the number of samples.
You could use cross_val_predict
(See the scikit-learn docs) instead of cross_val_score
.
instead of doing :
from sklearn.model_selection import cross_val_score scores = cross_val_score(clf, x, y, cv=10)
you can do :
from sklearn.model_selection import cross_val_predict from sklearn.metrics import confusion_matrix y_pred = cross_val_predict(clf, x, y, cv=10) conf_mat = confusion_matrix(y, y_pred)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With