Determining if the difference between two error values is significant

Tags:

I'm evaluating a number of different algorithms whose job is to predict the probability of an event occurring.

I am testing the algorithms on large-ish datasets. I measure their effectiveness using "Root Mean Squared Error", which is the square root of the ((sum of the errors) squared). The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0).

So I know the RMSE, and also the number of samples that the algorithm was tested on.

The problem is that sometimes the RMSE values are quite close to each-other, and I need a way to determine whether the difference between them is just chance, or if it represents an actual difference in performance.

Ideally, for a given pair of RMSE values, I'd like to know what the probability is that one is really better than the other, so that I can use this probability as a threshold of significance.

426

asked Jan 30 '10 18:01

sanity

2 Answers

The MSE is an average and hence the central limit theorem applies. So testing whether two MSEs are the same is the same as testing whether two means are equal. A difficulty compared to a standard test comparing two means is that your samples are correlated -- both come from the same events. But a difference in MSE is the same as a mean of differenced squared errors (means are linear). This suggests calculating a one-sample t-test as follows:

For each x compute a error e for procedure 1 and 2.
Compute differences of squared errors (e2^2-e1^2).
Compute the mean of the differences.
Compute the standard deviation of the differences.
Compute a t-statistic as mean/(sd/sqrt(n)).
Compare your t-statistic to a critical value or compute a p-value. For instance, reject equality at 5% confidence level if |t|>1.96.

The RMSE is a monotonic transformation of MSE so this test shouldn't give substantively different results. But be careful not to assume that MRSE is RMSE.

A bigger concern should be overfitting. Make sure to compute all your MSE statistics using data that you did not use to estimate your model.

answered Oct 20 '22 14:10

Tristan

You are entering into a vast and contentious area of not only computation but philosophy. Significance tests and model selection are subjects of intense disagreement between the Bayesians and the Frequentists. Triston's comment about splitting the data-set into training and verification sets would not please a Bayesian.

May I suggest that RMSE is not an appropriate score for probabilities. If the samples are independent, the proper score is the sum of the logarithms of the probabilities assigned to the actual outcomes. (If they are not independent, you have a mess on your hands.) What I am describing is scoring a "plug-in" model. Proper Bayesian modeling requires integrating over the model parameters, which is computationally extremely difficult. A Bayesian way to regulate a plug-in model is to add a penalty to the score for unlikely (large) model parameters. That's been called "weight decay."

I got started on my path of discovery reading Neural Networks for Pattern Recognition by Christopher Bishop. I used it and and Practical Optimization by Gill, et al to write software that has worked very well for me.

answered Oct 20 '22 14:10

Jive Dadson

Related questions
                            
                                How to calculate sample and population variances in Matlab?
                            
                                Build difference between groups with dplyr in r
                            
                                Equivalent of R's of cor.test in Python
                            
                                Efficient way to sample a large array many times with NumPy?
                            
                                Add statistical information to the bottom of a graph
                            
                                How can I get R to read a column of numbers in exponential notation?
                            
                                How to build a chi-square distribution table
                            
                                Python, Pandas & Chi-Squared Test of Independence
                            
                                How to create a discrete normal distribution in R?
                            
                                Confidence interval for the difference between two proportions in Python
                            
                                compute maximum f1 score using precision_recall_curve?
                            
                                User access log to SQL Server
                            
                                Generate Random Number Based on Beta Distribution using Boost
                            
                                weight data with R Part II
                            
                                Excel Graph - Category and Subcategory grouping
                            
                                python scipy.stats.powerlaw negative exponent
                            
                                Visualization - Tableau
                            
                                How can I generate box-and-whisker plots with variable box width, in gnuplot?
                            
                                How do I run a ldap query using R?
                            
                                Predicting missing data values in a database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Determining if the difference between two error values is significant

Tags:

statistics

probability

measurement

sanity

People also ask

2 Answers

Tristan

Jive Dadson

Recent Activity

Donate For Us