I'm using scikit to perform a logistic regression on spam/ham data. X_train is my training data and y_train the labels('spam' or 'ham') and I trained my LogisticRegression this way:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
If I want to get the accuracies for a 10 fold cross validation, I just write:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
I thought it was possible to calculate also the precisions and recalls by simply adding one parameter this way:
precision = cross_val_score(classifier, X_train, y_train, cv=10, scoring='precision')
recall = cross_val_score(classifier, X_train, y_train, cv=10, scoring='recall')
But it results in a ValueError
:
ValueError: pos_label=1 is not a valid label: array(['ham', 'spam'], dtype='|S4')
Is it related to the data (should I binarize the labels ?) or do they change the cross_val_score
function ?
Thank you in advance !
The precision is intuitively the ability of the classifier not to label a negative sample as positive. The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.
Next, to implement cross validation, the cross_val_score method of the sklearn.model_selection library can be used. The cross_val_score returns the accuracy for all the folds. Values for 4 parameters are required to be passed to the cross_val_score class.
Can I train my model using cross_val_score? A common question developers have is whether cross_val_score can also function as a way of training the final model. Unfortunately this is not the case. Cross_val_score is a way of assessing a model and it's parameters, and cannot be used for final training.
To compute the recall and precision, the data has to be indeed binarized, this way:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(y_train)
To go further, i was surprised that I didn't have to binarize the data when I wanted to calculate the accuracy:
accuracy = cross_val_score(classifier, X_train, y_train, cv=10)
It's just because the accuracy formula doesn't really need information about which class is considered as positive or negative: (TP + TN) / (TP + TN + FN + FP). We can indeed see that TP and TN are exchangeable, it's not the case for recall, precision and f1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With