How does sklearn select threshold steps in precision recall curve?

Tags:

I trained a basic FFNN on a example breast cancer dataset. For the results the precision_recall_curve function gives datapoints for 416 different thresholds. My Data contains 569 unique prediction values, as far as I understand the Precision Recall Curve I could apply 568 different threshold values and check the resulting Precision and Recall.

But how do I do so? is there a way to set the number of thresholds to test with sklearn? Or at least an explanation of how sklearn selects those thresholds?

I mean 417 should be enough, even for bigger data sets, I am just curious how they got selected.

# necessary packages
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

# load data
sk_data = load_breast_cancer(return_X_y=False)

# safe data in pandas
data = sk_data['data']
target = sk_data['target']
target_names = sk_data['target_names']
feature_names = sk_data['feature_names']
data = pd.DataFrame(data=data, columns=feature_names)

# build ANN
model = Sequential()
model.add(Dense(64, kernel_initializer='random_uniform', input_dim=30, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(32, kernel_initializer='random_uniform', activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(1, activation='sigmoid'))

# train ANN
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

model.fit(data, target, epochs=50, batch_size=10, validation_split=0.2)

# eval
pred = model.predict(data)

# calculate precision-recall curve
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(target, pred)

# precision-recall curve and f1
import matplotlib.pyplot as plt

#pyplot.plot([0, 1], [0.5, 0.5], linestyle='--')
plt.plot(recall, precision, marker='.')
# show the plot
plt.show()

len(np.unique(pred)) #569
len(thresholds) # 417

996

asked Sep 24 '19 09:09

Quastiat

1 Answers

Reading the source, precision_recall_curve does compute precision and recall for each unique predicted probability (here pred) but then omits the output for all thresholds that result in full recall (apart from the very first threshold to achieve full recall).

148

answered Sep 18 '22 19:09

Paul Brodersen

Related questions
                            
                                Why is not 'decimal.Decimal(1)' an instance of 'numbers.Real'?
                            
                                Why is len() not implemented for Queues?
                            
                                Django 2.0: sqlite IntegrityError: FOREIGN KEY constraint failed
                            
                                Get Excel cell background color in pandas read_excel?
                            
                                What is the grep equivalent in Python?
                            
                                How to save numpy ndarray as .csv file?
                            
                                Statistical Profiling in Python
                            
                                Jupyter notebook has become very slow suddenly
                            
                                "import torch" giving error "from torch._C import *, DLL load failed: The specified module could not be found"
                            
                                Efficient random generator for very large range (in python)
                            
                                What is a "cell class" in Keras?
                            
                                Airflow webserver gives cron error for dags with None as schedule interval
                            
                                Understanding Bilinear Layers
                            
                                Bicubic interpolation Python
                            
                                Convert Python dictionary to yaml
                            
                                Print specific keys and values from a deep nested dictionary in python 3.X
                            
                                Pytest skips test saying "asyncio not installed"
                            
                                Efficiently replace elements in array based on dictionary - NumPy / Python
                            
                                Pandas: TypeError: '>' not supported between instances of 'int' and 'str' when selecting on date column
                            
                                Comparison between Modin | Dask | Data.table | Pandas for parallel processing and out of memory csv files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does sklearn select threshold steps in precision recall curve?

Tags:

python

precision

scikit-learn

precision-recall

Quastiat

People also ask

1 Answers

Paul Brodersen

Recent Activity

Donate For Us