I have a website where users can "Like" and "Dislike" items.
So for each item, I have data such as the total number of "Likes" and the % of total votes that are "Likes".
I'd like to calculate just a single score to show to users. Using just % wouldn't work because even though item_A might have a 90% of "Likes" while item_B might have a 80% of "Likes", item_B should still rank in front of item_A if item_B has 10,000 total votes while item_A only has 1,000 total votes.
Likewise using just total "Likes" wouldn't work because while an item might have a large number of "Likes" it shouldn't be ranked very high if the % of "Likes" is low.
What would be a good algorithm to create a single score out of the data above?
Ideally the score should be "meaningful" or "normalized" in some way. For example if I go to IMDB and I see that a movie has a score of 8/10, I'd immediately know that it is a good movie. On the other hand if I see a score of 1,370 I wouldn't necessarily know if that is good or bad.
Bayesian Rating is a perfect fit for what you want to do. It takes care of the fewer votes but higher rating issue.
Bayesian Rating is using the Bayesian Average. This is a mathematical term that calculates a rating of an item based on the “believability” of the votes. The greater the certainty based on the number of votes, the more the Bayesian rating approximates the plain, unweighted rating. When there are very few votes, the bayesian rating of an item will be closer to the average rating of all items.
Use this equation:
br = ( (avg_num_votes * avg_rating) + (this_num_votes * this_rating) ) / (avg_num_votes + this_num_votes)
Legend:
avg_num_votes: The average number of votes of all items that have num_votes>0
avg_rating: The average rating of each item (again, of those that have num_votes>0)
this_num_votes: number of votes for this item
this_rating: the rating of this item
Note: avg_num_votes is used as the “magic” weight in this formula. The higher this value, the more votes it takes to influence the bayesian rating value.
You can read more here
There's a couple of very good articles on how Reddit does this sort of ranking here, and here. In a nutshell, rank posts by the lower end of the 90% confidence interval of their scores. Entries with fewer votes have larger confidence intervals, and hence tend to rank lower than entries with more votes but the same average.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With