Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing a popularity algorithm in Django

I am creating a site similar to reddit and hacker news that has a database of links and votes. I am implementing hacker news' popularity algorithm and things are going pretty swimmingly until it comes to actually gathering up these links and displaying them. The algorithm is simple:

Y Combinator's Hacker News:
Popularity = (p - 1) / (t + 2)^1.5`

Votes divided by age factor.
Where`

p : votes (points) from users.
t : time since submission in hours.

p is subtracted by 1 to negate submitter's vote.
Age factor is (time since submission in hours plus two) to the power of 1.5.factor is (time since submission in hours plus two) to the power of 1.5.

I asked a very similar question over yonder Complex ordering in Django but instead of contemplating my options I choose one and tried to make it work because that's how I did it with PHP/MySQL but I now know Django does things a lot differently.

My models look something (exactly) like this

class Link(models.Model):
category = models.ForeignKey(Category)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
fame = models.PositiveIntegerField(default = 1)
title = models.CharField(max_length = 256)
url = models.URLField(max_length = 2048)

def __unicode__(self):
    return self.title

class Vote(models.Model):
link = models.ForeignKey(Link)
user = models.ForeignKey(User)
created = models.DateTimeField(auto_now_add = True)
modified = models.DateTimeField(auto_now = True)
karma_delta = models.SmallIntegerField()

def __unicode__(self):
    return str(self.karma_delta)

and my view:

def index(request):
popular_links = Link.objects.select_related().annotate(karma_total = Sum('vote__karma_delta'))
return render_to_response('links/index.html', {'links': popular_links})

Now from my previous question, I am trying to implement the algorithm using the sorting function. An answer from that question seems to think I should put the algorithm in the select and sort then. I am going to paginate these results so I don't think I can do the sorting in python without grabbing everything. Any suggestions on how I could efficiently do this?

EDIT

This isn't working yet but I think it's a step in the right direction:

from django.shortcuts import render_to_response
from linkett.apps.links.models import *

def index(request):
popular_links = Link.objects.select_related()
popular_links = popular_links.extra(
    select = {
        'karma_total': 'SUM(vote.karma_delta)',
        'popularity': '(karma_total - 1) / POW(2, 1.5)',
    },
    order_by = ['-popularity']
)
return render_to_response('links/index.html', {'links': popular_links})

This errors out into:

Caught an exception while rendering: column "karma_total" does not exist
LINE 1: SELECT ((karma_total - 1) / POW(2, 1.5)) AS "popularity", (S...

EDIT 2

Better error?

TemplateSyntaxError: Caught an exception while rendering: missing FROM-clause entry for table "vote"
LINE 1: SELECT ((vote.karma_total - 1) / POW(2, 1.5)) AS "popularity...

My index.html is simply:

{% block content %}

{% for link in links %}
 
  
   karma-up
   {{ link.karma_total }}
   karma-down
  
  {{ link.title }}
  Posted by {{ link.user }} to {{ link.category }} at {{ link.created }}
 
{% empty %}
 No Links
{% endfor %}

{% endblock content %}

EDIT 3 So very close! Again, all these answers are great but I am concentrating on a particular one because I feel it works best for my situation.

from django.db.models import Sum
from django.shortcuts import render_to_response
from linkett.apps.links.models import *

def index(request): popular_links = Link.objects.select_related().extra( select = { 'popularity': '(SUM(links_vote.karma_delta) - 1) / POW(2, 1.5)', }, tables = ['links_link', 'links_vote'], order_by = ['-popularity'], ) return render_to_response('links/test.html', {'links': popular_links})

Running this I am presented with an error hating on my lack of group by values. Specifically:

TemplateSyntaxError at /
Caught an exception while rendering: column "links_link.id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...karma_delta) - 1) / POW(2, 1.5)) AS "popularity", "links_lin...

Not sure why my links_link.id wouldn't be in my group by but I am not sure how to alter my group by, django usually does that.

like image 744
TheLizardKing Avatar asked Dec 27 '09 06:12

TheLizardKing


3 Answers

On Hacker News, only the 210 newest stories and 210 most popular stories are paginated (7 pages worth * 30 stories each). My guess is that the reason for the limit (at least in part) is this problem.

Why not drop all the fancy SQL for the most popular stories and just keep a running list instead? Once you've established a list of the top 210 stories you only need to worry about reordering when a new vote comes in since relative order is maintained over time. And when a new vote does come in, you only need to worry about reordering the story that received the vote.

If the story that received the vote is not on the list, calculate the score of that story, plus the least popular story that is on the list. If the story that received the vote is lower, you're done. If it's higher, calculate the current score for the second-to-least most popular (story 209) and compare again. Continue working up until you find a story with a higher score and then place the newly-voted-upon story right below that one in the rankings. Unless, of course, it reaches #1.

The benefit of this approach is that it limits the set of stories you have to look at to figure out the top stories list. In the absolute worst case scenario, you have to calculate the ranking for 211 stories. So it's very efficient unless you have to establish the list from an existing data set - but that's just a one-time penalty assuming you cache the list someplace.

Downvotes are another issue, but I can only upvote (at my karma level, anyway).

like image 143
John Debs Avatar answered Nov 14 '22 04:11

John Debs


popular_links = Link.objects.select_related()
popular_links = popular_links.extra(
    select = {
        'karma_total': 'SUM(vote.karma_delta)',
        'popularity': '(karma_total - 1) / POW(2, 1.5)'
    },
    order_by = ['-popularity']
)

Or select some sane number, sort the selection using python in any way you like, and cache if its going to be static for all users which it looks like it will - set cache expiration to a minute or so.

But the extra will work better for paginated results in a highly dynamic setup.

like image 4
kibitzer Avatar answered Nov 14 '22 06:11

kibitzer


Seems like you could overload the save of the Vote class and have it update the corresponding Link object. Something like this should work well:

from datetime import datetime, timedelta

class Link(models.Model):
 category = models.ForeignKey(Category)
 user = models.ForeignKey(User)
 created = models.DateTimeField(auto_now_add = True)
 modified = models.DateTimeField(auto_now = True)
 fame = models.PositiveIntegerField(default = 1)
 title = models.CharField(max_length = 256)
 url = models.URLField(max_length = 2048)

 #a field to keep the most recently calculated popularity
 popularity = models.FloatField(default = None)

 def CalculatePopularity(self):
  """
  Add a shorcut to make life easier ... this is used by the overloaded save() method and 
  can be used in a management function to do a mass-update periodically
  """
  ts = datetime.now()-self.created
  th = ts.seconds/60/60
  self.popularity = (self.user_set.count()-1)/((th+2)**1.5)

 def save(self, *args, **kwargs):
  """
  Modify the save function to calculate the popularity
  """
  self.CalculatePopularity()
  super(Link, self).save(*args, **kwargs)


 def __unicode__(self):
     return self.title

class Vote(models.Model):
 link = models.ForeignKey(Link)
 user = models.ForeignKey(User)
 created = models.DateTimeField(auto_now_add = True)
 modified = models.DateTimeField(auto_now = True)
 karma_delta = models.SmallIntegerField()

 def save(self, *args, **kwargs):
  """
  Modify the save function to calculate the popularity of the Link object
  """
  self.link.CalculatePopularity()
  super(Vote, self).save(*args, **kwargs)

 def __unicode__(self):
     return str(self.karma_delta)

This way every time you call a link_o.save() or vote_o.save() it will re-calculate the popularity. You have to be a little careful because when you call Link.objects.all().update('updating something') then it won't call our overloaded save() function. So when I use this sort of thing I create a management command which updates all of the objects so they're not too out of date. Something like this will work wonderfully:

from itertools import imap
imap(lambda x:x.CalculatePopularity(), Link.objects.all().select_related().iterator())

This way it will only load a single Link object into memory at once ... so if you have a giant database it won't cause a memory error.

Now to do your ranking all you have to do is:

Link.objects.all().order_by('-popularity')

It will be super-fast since all of you Link items have already calculated the popularity.

like image 4
JudoWill Avatar answered Nov 14 '22 05:11

JudoWill