I have a very big data set (ds
). One of its columns is Popularity
, of type factor
('High' / ' Low').
I split the data to 70% and 30% in order to create a training set (ds_tr
) and a test set (ds_te
).
I have created the following model using a Logistic regression:
mdl <- glm(formula = popularity ~ . -url , family= "binomial", data = ds_tr )
then I created a predict
object (will do it again for ds_te
)
y_hat = predict(mdl, data = ds_tr - url , type = 'response')
I want to find the precision value which corresponds to a cutoff threshold of 0.5 and find the recall value which corresponds to a cutoff threshold of 0.5, so I did:
library(ROCR)
pred <- prediction(y_hat, ds_tr$popularity)
perf <- performance(pred, "prec", "rec")
The result is a table of many values
str(perf)
Formal class 'performance' [package "ROCR"] with 6 slots
..@ x.name : chr "Recall"
..@ y.name : chr "Precision"
..@ alpha.name : chr "Cutoff"
..@ x.values :List of 1
.. ..$ : num [1:27779] 0.00 7.71e-05 7.71e-05 1.54e-04 2.31e-04 ...
..@ y.values :List of 1
.. ..$ : num [1:27779] NaN 1 0.5 0.667 0.75 ...
..@ alpha.values:List of 1
.. ..$ : num [1:27779] Inf 0.97 0.895 0.89 0.887 ...
How do I find the specific precision and recall values corresponding to a cutoff threshold of 0.5?
Acces the slots of performance object (through the combination of @ + list)
We create a dataset with all possible values:
probab.cuts <- data.frame([email protected][[1]], [email protected][[1]], [email protected][[1]])
You can view all associated values
probab.cuts
If you want to select the requested values, it is trivial to do:
tail(probab.cuts[probab.cuts$cut > 0.5,], 1)
Manual check
tab <- table(ds_tr$popularity, y_hat > 0.5)
tab[4]/(tab[4]+tab[2]) # recall
tab[4]/(tab[4]+tab[3]) # precision
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With