Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you scale a voting system on a high traffic website?

View

You land on a page full of comments. The voting system of the comments is highlighted with your votes.

vote up reddit

Database

To support this requirement, the database schema would be the following, at a minimum:

Page

  • int pageId

Comments

  • int commentId
  • int pageId

Votes

  • int userId
  • int commentId
  • enum(up, down) direction

Controller

If the page ID were 123 and the user ID were 456, this would be the naive implementation of the controller:

1) Query for all the votes made by user 456 on the comments of page 123:

SELECT c.commentId, v.direction 
FROM comments AS c, votes AS v 
WHERE c.pageId = 123 
AND c.commentId = v.commentId
AND v.userId = 456

2) Construct the view with the results of this query.

Scalability Problem

The database query to support this voting system is very expensive. The comments and votes table will be huge. On a high traffic site, thousands of users will be executing this query every second to receive a personalized view of the comment voting. How do you scale this voting system so the database does not become overloaded with too many requests? Would you cache it in memory? Isn't it bet practice to only cache things common to a large audience? In this case, these queries are specific to single users. Memory would quickly be filled up on a website with millions of users. Cache misses would occur and the database would be thrashed.

like image 758
JoJo Avatar asked Nov 05 '22 16:11

JoJo


1 Answers

I think Reddit will cache/store (for each comment) the list of users that up voted it (and another for down votes) and only update this cache every X seconds/minutes/hours depending on activity. The list will be organised such that a binary search is possible.

Then when generating the page the server only needs to say check whether the current user id is "in the list" of up/down voters for each comment. Reddit also limits the number of comments initially visible, which will reduce the number of tests required.

Reddit also doesn't update votes immediately (they add the vote to a queue). They can tie together the processing of the queue and the caching of the votes.

I assume reddit must also keep track of recent votes on a per user basis so that they can fill in the gaps between the last cache update and now.

This might not be 100% accurate.
It is based on limited reading about Reddit architecture and what I would do.

like image 188
diolemo Avatar answered Nov 15 '22 04:11

diolemo