Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing Wilson Score in SQL

We have a relatively small table that we would like to sort based on rating, using the Wilson interval or a reasonable equivalent. I'm a reasonably smart guy, but my math fu is nowhere near strong enough to understand this:

Wilson Score

The above formula, I am told, calculates a score for a positive/negative (thumbs up/thumbs down) voting system. I've never taken a statistics course, and it's been 15 years since I've done any sort of advanced mathematics. I don't have a clue what the little hat that the p is wearing means, or what the backwards Jesus fish beneath z indicates.

I would like to know two things:

  1. Can this formula be altered to accommodate a 5-star rating system? I found this, but the author expresses his doubts as to the accuracy of his formula.

  2. How can this formula be expressed in a SQL function? Note that I do not need to calculate and sort in real-time. The score can be calculated and cached daily.

  3. Am I overlooking something built-in to Microsoft SQL Server?

like image 701
dansays Avatar asked Dec 06 '09 15:12

dansays


3 Answers

Instead of trying to manipulate the Wilson's algorithm to do a 5 star rating system. Why don't you look into a different algorithm? This is what imdb uses for their top 250: Bayesian Estimate

As for explaining the math in the Wilson's algorithm, below was posted on the link in your first post. It is written in Ruby.

require 'statistics2'

def ci_lower_bound(pos, n, power)
    if n == 0
        return 0
    end
    z = Statistics2.pnormaldist(1-power/2)
    phat = 1.0*pos/n
    (phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end

If you'd like another example, here is one in PHP: http://www.derivante.com/2009/09/01/php-content-rating-confidence/

Edit: It seems that derivante.com is no longer around. You can see the original article on archive.org - https://web.archive.org/web/20121018032822/http://derivante.com/2009/09/01/php-content-rating-confidence/ and I've added the code from the article below.

class Rating
{
  public static function ratingAverage($positive, $total, $power = '0.05')
  {
    if ($total == 0)
      return 0;
    $z = Rating::pnormaldist(1-$power/2,0,1);
    $p = 1.0 * $positive / $total;
    $s = ($p + $z*$z/(2*$total) - $z * sqrt(($p*(1-$p)+$z*$z/(4*$total))/$total))/(1+$z*$z/$total);
    return $s;
  }
  public static function pnormaldist($qn)
  {
    $b = array(
      1.570796288, 0.03706987906, -0.8364353589e-3,
      -0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5,
      -0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8,
      0.3657763036e-10, 0.6936233982e-12);
    if ($qn < 0.0 || 1.0 < $qn)
      return 0.0;
    if ($qn == 0.5)
      return 0.0;
    $w1 = $qn;
    if ($qn > 0.5)
      $w1 = 1.0 - $w1;
    $w3 = - log(4.0 * $w1 * (1.0 - $w1));
    $w1 = $b[0];
    for ($i = 1;$i <= 10; $i++)
      $w1 += $b[$i] * pow($w3,$i);
    if ($qn > 0.5)
      return sqrt($w1 * $w3);
    return - sqrt($w1 * $w3);
  }
}

As for doing this in SQL, SQL has all these Math functions already in it's library. If I were you I'd do this in your application though. Make your application update your database every so often (hours? days?) instead of doing this on the fly or your application will become very slow.

like image 124
William Avatar answered Oct 17 '22 08:10

William


Regarding your first question (adjusting the formula to the 5-stars system) I would agree with Paul Creasey.

conversion formula: [3 +/- i stars -> i up/down-votes] (3 stars -> 0)

example: 4 stars -> +1 up-vote, 5 stars -> +2, 1 -> -2 and so on.

I would note though that instead of the lower bound of the interval that both ruby and php functions compute, I would just compute the much more simple wilson midpoint:

(x + (z^2)/2) / (n + z^2)

where:
n = Sum(up_votes) + Sum(|down_votes|)
x = (positive votes)/n = Sum(up_votes) / n
z = 1.96 (fixed value)

like image 26
Galois Avatar answered Oct 17 '22 09:10

Galois


Taking Williams link to the php solution http://www.derivante.com/2009/09/01/php-content-rating-confidence/ and making your system such that it just postive and negative (5 stars could be 2 pos, 1 start could be 2 neg perhaps) then it would be fairly easy to convert it to T-SQL, but you'd be much better off doing it in the server side logic.

like image 2
Paul Creasey Avatar answered Oct 17 '22 08:10

Paul Creasey