I'm evaluating a number of different algorithms whose job is to predict the probability of an event occurring.
I am testing the algorithms on large-ish datasets. I measure their effectiveness using "Root Mean Squared Error", which is the square root of the ((sum of the errors) squared). The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0).
So I know the RMSE, and also the number of samples that the algorithm was tested on.
The problem is that sometimes the RMSE values are quite close to each-other, and I need a way to determine whether the difference between them is just chance, or if it represents an actual difference in performance.
Ideally, for a given pair of RMSE values, I'd like to know what the probability is that one is really better than the other, so that I can use this probability as a threshold of significance.
Start by looking at the left side of your degrees of freedom and find your variance. Then, go upward to see the p-values. Compare the p-value to the significance level or rather, the alpha. Remember that a p-value less than 0.05 is considered statistically significant.
Look up the normal distribution in a statistics table. Statistics tables can be found online or in statistics textbooks. Find the value for the intersection of the correct degrees of freedom and alpha. If this value is less than or equal to the chi-square value, the data is statistically significant.
The MSE is an average and hence the central limit theorem applies. So testing whether two MSEs are the same is the same as testing whether two means are equal. A difficulty compared to a standard test comparing two means is that your samples are correlated -- both come from the same events. But a difference in MSE is the same as a mean of differenced squared errors (means are linear). This suggests calculating a one-sample t-test as follows:
x
compute a error e
for procedure 1 and 2.(e2^2-e1^2)
.mean/(sd/sqrt(n))
.|t|>1.96
.The RMSE is a monotonic transformation of MSE so this test shouldn't give substantively different results. But be careful not to assume that MRSE is RMSE.
A bigger concern should be overfitting. Make sure to compute all your MSE statistics using data that you did not use to estimate your model.
You are entering into a vast and contentious area of not only computation but philosophy. Significance tests and model selection are subjects of intense disagreement between the Bayesians and the Frequentists. Triston's comment about splitting the data-set into training and verification sets would not please a Bayesian.
May I suggest that RMSE is not an appropriate score for probabilities. If the samples are independent, the proper score is the sum of the logarithms of the probabilities assigned to the actual outcomes. (If they are not independent, you have a mess on your hands.) What I am describing is scoring a "plug-in" model. Proper Bayesian modeling requires integrating over the model parameters, which is computationally extremely difficult. A Bayesian way to regulate a plug-in model is to add a penalty to the score for unlikely (large) model parameters. That's been called "weight decay."
I got started on my path of discovery reading Neural Networks for Pattern Recognition by Christopher Bishop. I used it and and Practical Optimization by Gill, et al to write software that has worked very well for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With