I trained a basic FFNN on a example breast cancer dataset. For the results the precision_recall_curve
function gives datapoints for 416 different thresholds. My Data contains 569 unique prediction values, as far as I understand the Precision Recall Curve I could apply 568 different threshold values and check the resulting Precision and Recall.
But how do I do so? is there a way to set the number of thresholds to test with sklearn
? Or at least an explanation of how sklearn
selects those thresholds?
I mean 417 should be enough, even for bigger data sets, I am just curious how they got selected.
# necessary packages
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
# load data
sk_data = load_breast_cancer(return_X_y=False)
# safe data in pandas
data = sk_data['data']
target = sk_data['target']
target_names = sk_data['target_names']
feature_names = sk_data['feature_names']
data = pd.DataFrame(data=data, columns=feature_names)
# build ANN
model = Sequential()
model.add(Dense(64, kernel_initializer='random_uniform', input_dim=30, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(32, kernel_initializer='random_uniform', activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(1, activation='sigmoid'))
# train ANN
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
model.fit(data, target, epochs=50, batch_size=10, validation_split=0.2)
# eval
pred = model.predict(data)
# calculate precision-recall curve
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(target, pred)
# precision-recall curve and f1
import matplotlib.pyplot as plt
#pyplot.plot([0, 1], [0.5, 0.5], linestyle='--')
plt.plot(recall, precision, marker='.')
# show the plot
plt.show()
len(np.unique(pred)) #569
len(thresholds) # 417
Optimal Threshold for Precision-Recall Curve Recall is calculated as the ratio of the number of true positives divided by the sum of the true positives and the false negatives. Recall is the same as sensitivity.
The higher the threshold, the higher the precision, but the lower the recall. Note that high precision is required when you use IBM® Content Classification to generate automatic responses (these should be as accurate as possible). The ideal threshold setting is the highest possible recall and precision rate.
Precision and Recall Vs Threshold Graph As you can see, If you increase the threshold value Precision increases but Recall decreases and if you decrease the value then Recall increases but Precision decreases. At default threshold value (Zero), Precision is less than 80% and Recall is higher than 80%.
Reading the source, precision_recall_curve
does compute precision and recall for each unique predicted probability (here pred
) but then omits the output for all thresholds that result in full recall (apart from the very first threshold to achieve full recall).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With