I need to set a value to a specific threshold and generate a confusion matrix. The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing?
First, i received a error message: ""AttributeError: predict_proba is not available when probability=False"" So i used this for correction:
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
I saw a lot on the internet and I didn't quite understand how a specific threshold value is being persolanized. Sounds pretty hard. Now, i see a wrong output:
array([[ 0, 0],
[5359, 65]])
I have no idea whats is somenthing wrong.
i need help and i'm new in that. thanks
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
# set threshold as -220
y_pred = (svc_model.predict_proba(X_test)[:,1] >= -220)
conf_matrix = confusion_matrix(y_pred, svc_model.predict(X_test))
return conf_matrix
answer_four()
This function should return a confusion matrix, a 2x2 numpy array with 4 integers.
This code produces the expected output, in addition to the fact that in the previous code I was using the confusion matrix incorrectly I should have also used decision_function and getting the output filtering the 220 threshold.
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
#SVC without mencions of kernel, the default is rbf
svc = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
#decision_function scores: Predict confidence scores for samples
y_score = svc.decision_function(X_test)
#Set a threshold -220
y_score = np.where(y_score > -220, 1, 0)
conf_matrix = confusion_matrix(y_test, y_score)
####threshold###
#input threshold in the model after trained this model
#threshold is a limiar of separation of class
return conf_matrix
answer_four()
#output:
array([[5320, 24],
[ 14, 66]])
You are using the confusion matrix in a wrong way.
The idea behind the confusion matrix is to have a picture as to how good our predictions y_pred
are compared with the ground truth y_true
, usually in a test set.
What you actually do here is computing a "confusion matrix" between your predictions with the custom threshold of -220 (y_pred
), compared to some other predictions with the default threshold (the output of svc_model.predict(X_test)
), which does not make any sense.
Your ground truth for the test set is y_test
; so, to get the confusion matrix with the default threshold, you should use
confusion_matrix(y_test, svc_model.predict(X_test))
To get the confusion matrix with your custom threshold of -220, you should use
confusion_matrix(y_test, y_pred)
See the documentation for more details in the usage (which is your best friend, and should always be the first place to look at, when having issues or doubts).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With