Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I incorporate the Reddit page/post ranking algorithm?

I'm trying to learn how to code a website algorithm like Reddit.com where there are thousands of posts that need to be ranked. Their ranking algorithm works like this (you don't have to read it, its more of a general question that I have): http://amix.dk/blog/post/19588

Right now I have posts stored in a database, I record their dates and they each have an upvotes and downvotes field so I'm storing their records. I want to figure out how do you store their rankings? When specific posts have ranking values, but they change with time, how could you store their rankings?

If they aren't stored, do you rank every post every time a user loads the page?

When would you store the posts? Do you run a cron job to automatically give every post a new value every x minutes? Do you store their value? Which is temporary. Maybe, until that post reaches its minimum score and is forgotten?

like image 389
Thingamajig Avatar asked Oct 01 '12 03:10

Thingamajig


3 Answers

I would definitely not calculate their rank every time you display them.

A simple, and not so performant solution would be to cache post rankings, and once one post's ranking changes, you clear or refresh the cache.

That is not ideal, but it is possible.

Another way would be to do as you alluded to: calculate and store ranks in the database (and ideally cache them), and then refresh those rankings using a cron job every x minutes.

Again, these are basic approaches to what you want to do. You can then build on them over time.

The algorithm you choose will most likely be very particular to your needs.

You need to also gauge what kind of traffic your site would be getting, as it would dictate what kind of lengths you should go through to get the right algorithm.

like image 143
Kovo Avatar answered Oct 20 '22 08:10

Kovo


I would instantly calculate a score for the single vote on a time-weighted scale. I would send that score into a queue or use it to increment a field depending whichever of those is performant for you.

At a regular time interval, I would take all currently ranked articles and all articles that have received votes during the time window and rescore all ranked articles followed by all the queued articles in descending order of score until I had calculated enough to fill my ranking quota.

The ranking list would be cached and used until the next ranking cycle. You'll have to adjust the queue retention period (maybe anything that had activity in the last N queues is re-queued), retention of articles, etc. based on your site load, but this should be a well-performing starting point.

like image 24
Jeff Ferland Avatar answered Oct 20 '22 09:10

Jeff Ferland


If you're using the exact algorithm reddit uses, you only need to change the ranking field whenever an item is voted up or down - and really only when the difference between upvotes and downvotes changes orders of magnitude. This article explains a little more about how their ranking works.

http://bibwild.wordpress.com/2012/05/08/reddit-story-ranking-algorithm/

Basically, the up and down votes only serve to "displace" the posts. If D is the difference between the number of upvotes and downvotes, a post is shifted up or down 12 hours per order of magnitude of D. Other than that, it's just a simple time ranking.


if however you want to use your own ranking system where age of the post matters in some way other than linearly, you'll have to either create an indexed field and recalculate the rankings at time intervals as has been said, or just put your sorting into your SQL query, as I've said in my comment. But chances are, you can find a way where it doesn't have to be recalculated over and over.

like image 1
Nacht Avatar answered Oct 20 '22 09:10

Nacht