Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are Reddit and Hacker News ranking algorithms used?

Tags:

php

mysql

ranking

I've been looking at ranking algorithms recently, specifically those used by Reddit and Hacker News. The algorithms themselves are simple enough, but I don't quite understand how they are used.

One thing that I could do is implement the algorithm straight in SQL, so that every time a user goes to a page displaying ranked posts, something like this would run:

SELECT thing1, thing2 FROM table
ORDER BY ranking_algorithm DESC
LIMIT page*20, 20

There are several similar questions on SO, but the only answer given is to put the ranking algorithm inside the SQL query. Then the thread dies...

Putting the algorithm in the SQL query fine on a smaller scale, but what if the website has a large number of users and a very large number of posts? That means that the every time any user opens a page that displays ranked posts, that query will be run. That can't be very efficient.

Now, Reddit and Hacker News don't run their ranking algorithms in as SQL queries, but in python and ark respectively. So how and when exactly are they used?

One possible solution is to take all the relevant info from every post and store it in some data structure on the web server. Then rank and sort this data structure.

Every time someone opens a page that shows the ranked posts, you just go to the data structure, retrieve the correct range of posts, and display them.

Then every half hour or so, you retrieve the most up to date information from the server, rank it, sort it, and update the data structure.

Other less expensive queries, such as retrieving and displaying all the info from a specific post, or displaying the newest posts (as opposed to the best scored) could be done in SQL every time the relevant page is opened.

The advantage is that your database is being hit (for the expensive ranking query) only once every half hour. The disadvantage is that you need to have a duplicate of a large chunk of your database.

like image 909
Morglor Avatar asked Mar 10 '11 15:03

Morglor


People also ask

How does Reddit ranking algorithm work?

Reddit uses a story algorithm, meaning the number of votes and submission time of links have the biggest impact on how stories rank on the platform. Reddit also ranks items by the number of votes they accumulate, as well as the age of the post compared to others.

How does Reddit rank hot posts?

Reddit's hot ranking uses the logarithm function to weight the first votes higher than the rest. Generally this applies: The first 10 upvotes have the same weight as the next 100 upvotes which have the same weight as the next 1000 etc…

What is a ranking algorithm?

A ranking algorithm is a procedure that ranks items in a dataset according to some criterion. Ranking algorithms are used in many different applications, such as web search, recommender systems, and machine learning.


2 Answers

I implemented an SQL version of Reddit's ranking algorithm for a video aggregator like so:

SELECT id, title
FROM videos
ORDER BY 
    LOG10(ABS(cached_votes_total) + 1) * SIGN(cached_votes_total)   
    + (UNIX_TIMESTAMP(created_at) / 300000) DESC
LIMIT 50

cached_votes_total is updated by a trigger whenever a new vote is cast. It runs fast enough on our current site, but I am planning on adding a ranking value column and updating it with the same trigger as the cached_votes_total column. After that optimization, it should be fast enough for most any size site.

edit: More information at Reddit Hotness Algorithm in SQL

like image 67
Chris Broski Avatar answered Sep 18 '22 19:09

Chris Broski


Reddit uses Pyrex, the sort algorithm is a Python C extension to improve performance.

So, you can do the same in SQL when the record is updated, pex: when is up or down voted.

The pseudocode you must to translate to your SQL engine syntax:

function hot(ups, downs, date){
    score = ups - downs;
    order = log(max(abs(score), 1), 10);
    if (score>0){
        sign = 1;
    } else {
        if (score<0){
            sign = -1;
        } else {
            sign = 0;
        }
    }
    td = date - datetime(1970,1,1);
    seconds = td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000) - 1134028003;

    return round(order + sign * seconds / 45000, 7);
}

So you must to store in the post table the ups, downs, date and the hot function result. And then you can make a sort in the hot column.

You can see the Reddit source code here: http://code.reddit.com/

like image 31
Andrés Morales Avatar answered Sep 16 '22 19:09

Andrés Morales