Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining if the difference between two error values is significant

I'm evaluating a number of different algorithms whose job is to predict the probability of an event occurring.

I am testing the algorithms on large-ish datasets. I measure their effectiveness using "Root Mean Squared Error", which is the square root of the ((sum of the errors) squared). The error is the difference between the predicted probability (a floating point value between 0 and 1) and the actual outcome (either 0.0 or 1.0).

So I know the RMSE, and also the number of samples that the algorithm was tested on.

The problem is that sometimes the RMSE values are quite close to each-other, and I need a way to determine whether the difference between them is just chance, or if it represents an actual difference in performance.

Ideally, for a given pair of RMSE values, I'd like to know what the probability is that one is really better than the other, so that I can use this probability as a threshold of significance.

like image 426
sanity Avatar asked Jan 30 '10 18:01

sanity


People also ask

How do you determine if the difference between two values is significant?

Start by looking at the left side of your degrees of freedom and find your variance. Then, go upward to see the p-values. Compare the p-value to the significance level or rather, the alpha. Remember that a p-value less than 0.05 is considered statistically significant.

How do you test if the difference is significant?

Look up the normal distribution in a statistics table. Statistics tables can be found online or in statistics textbooks. Find the value for the intersection of the correct degrees of freedom and alpha. If this value is less than or equal to the chi-square value, the data is statistically significant.


2 Answers

The MSE is an average and hence the central limit theorem applies. So testing whether two MSEs are the same is the same as testing whether two means are equal. A difficulty compared to a standard test comparing two means is that your samples are correlated -- both come from the same events. But a difference in MSE is the same as a mean of differenced squared errors (means are linear). This suggests calculating a one-sample t-test as follows:

  1. For each x compute a error e for procedure 1 and 2.
  2. Compute differences of squared errors (e2^2-e1^2).
  3. Compute the mean of the differences.
  4. Compute the standard deviation of the differences.
  5. Compute a t-statistic as mean/(sd/sqrt(n)).
  6. Compare your t-statistic to a critical value or compute a p-value. For instance, reject equality at 5% confidence level if |t|>1.96.

The RMSE is a monotonic transformation of MSE so this test shouldn't give substantively different results. But be careful not to assume that MRSE is RMSE.

A bigger concern should be overfitting. Make sure to compute all your MSE statistics using data that you did not use to estimate your model.

like image 72
Tristan Avatar answered Oct 20 '22 14:10

Tristan


You are entering into a vast and contentious area of not only computation but philosophy. Significance tests and model selection are subjects of intense disagreement between the Bayesians and the Frequentists. Triston's comment about splitting the data-set into training and verification sets would not please a Bayesian.

May I suggest that RMSE is not an appropriate score for probabilities. If the samples are independent, the proper score is the sum of the logarithms of the probabilities assigned to the actual outcomes. (If they are not independent, you have a mess on your hands.) What I am describing is scoring a "plug-in" model. Proper Bayesian modeling requires integrating over the model parameters, which is computationally extremely difficult. A Bayesian way to regulate a plug-in model is to add a penalty to the score for unlikely (large) model parameters. That's been called "weight decay."

I got started on my path of discovery reading Neural Networks for Pattern Recognition by Christopher Bishop. I used it and and Practical Optimization by Gill, et al to write software that has worked very well for me.

like image 21
Jive Dadson Avatar answered Oct 20 '22 14:10

Jive Dadson