Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where do mathematical algorithms for Reddit's ranking, as an example, come from?

recently I was looking at Reddit's algorithm for determining what makes a post a "hot" topic and which content is suitable for the reddit homepage.

the article I was reading is here: http://amix.dk/blog/post/19588

I've noticed they have mathematical logorithms and have created some kind of a mathematical function to determine the hotness/relevance of a post.

In the formulas used, where do each of the mathematical components come from and how do they know to use them?

thank you!

-- Bakz

EDIT: just to clarify, I just graduated high school and apologize if the answer to this question seems pretty obvious. thanks again!

like image 876
bakz Avatar asked Jul 03 '11 22:07

bakz


People also ask

What is Reddit's algorithm?

Reddit uses a confidence sort algorithm, a way for top-rated comments on posts with the most data to rank the highest. This means that no matter which type of vote comment a post receives, it is considered good for the post.

What is a ranking algorithm?

A ranking algorithm is a procedure that ranks items in a dataset according to some criterion. Ranking algorithms are used in many different applications, such as web search, recommender systems, and machine learning.

Is Reddit algorithm open source?

Reddit is open sourced and the code is freely available. Reddit is implemented in Python and their code is located here. Their sorting algorithms are implemented in Pyrex, which is a language to write Python C extensions. They have used Pyrex for speed reasons.

How does Reddit rank hot posts?

The hot ranking algorithm ranks new stories higher than older stories. The logarithm scale: The hot ranking algorithm uses the logarithm function to weigh the earlier votes higher than later ones. This means that generally, the first ten upvotes that a post receives will have the same weight as the next 100 upvotes.


1 Answers

I'll tackle the first formula, for "hotness" of posts. Formulas like this come from requirements. The designers of Reddit have thought about what they want to achieve, and designed formulas accordingly. I can't tell you exactly what requirements they had in mind, but I can look at the implementation and guess that they wanted a system along these lines:

  1. Scores shouldn't need to be recomputed unless the number of votes change. This reduces the number of changes to the database, and makes it easier to achieve consistency if data is replicated. (So any scoring system based on scores getting lower as the article ages will be no good).

  2. If two stories are equally old, the one with more upvotes should be higher. (So there needs to be a contribution from the votes.)

  3. The more upvotes a story gets, the longer it should remain near the top of the ranking.

  4. Old stories shouldn't stay at the top of the rankings for ever, even if they had lots of upvotes. Fairly soon (after a day or two), new stories need to outrank them. (So there needs to be a contribution from the date, and this must outweigh the score due to votes fairly soon, no matter how many votes something gets.)

  5. Stories with more downvotes than upvotes should not appear in the rankings at all.

Now let's look at the formula: log z + yt / 45000 and see how it satisfies these requirements.

  1. If the number of votes does not change, then z, y and t are all unchanged. So the score is unchanged. This satisfies requirement (1).

  2. If two stories have the same age, then they have the same value for t. But the one with more upvotes has a higher value of z, and since log is monotonic, it has a higher score. This satisfies requirement (2).

  3. The more upvotes a story has, the higher its z, so the longer it will be until another story with higher t can outrank it. This satisfies requirement (3).

  4. Logarithm is a function that grows more slowly as it gets larger (take a look at its graph). So a story needs more and more upvotes over time to keep up with newer stories. This satisfies requirement (4).

  5. If the story has more downvotes than upvotes, then z = 1 and y = −1 so the score is negative. This satisfies requirement (5).

The constant 45,000 is a scale factor that brings the upvotes and the age into balance. There are 86,400 seconds in a day, so t gets larger by this amount each day. Dividing t by 45,000 gives 1.92 which means that one day's relative newness is worth is 101.92 = 83 votes, and two days' relative newness are worth roughly 7,000 votes.

like image 111
Gareth Rees Avatar answered Sep 21 '22 11:09

Gareth Rees