Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to determine most popular article last week, month and year?

I'm working on a project where I need to sort a list of user-submitted articles by their popularity (last week, last month and last year).

I've been mulling on this for a while, but I'm not a great statitician so I figured I could maybe get some input here.

Here are the variables available:

  • Time [date] the article was originally published
  • Time [date] the article was recommended by editors (if it has been)
  • Amount of votes the article has received from users (total, in the last week, in the last month, in the last year)
  • Number of times the article has been viewed (total, in the last week, in the last month, in the last year)
  • Number of times the article has been downloaded by users (total, in the last week, in the last month, in the last year)
  • Comments on the article (total, in the last week, in the last month, in the last year)
  • Number of times a user has saved the article to their reading-list (Total, in the last week, in the last month, in the last year)
  • Number of times the article has been featured on a kind of "best we've got to offer" (editorial) list (Total, in the last week, in the last month, in the last year)
  • Time [date] the article was dubbed 'article of the week' (if it has been)

Right now I'm doing some weighting on each variable, and dividing by the times it has been read. That's pretty much all I could come up with after reading up on Weighted Means. My biggest problem is that there are some user-articles that are always on the top of the popular-list. Probably because the author is "cheating".

I'm thinking of emphasizing the importance of the article being relatively new, but I don't want to "punish" articles that are genuinely popular just because they're a bit old.

Anyone with a more statistically adept mind than mine willing to help me out?

Thanks!

like image 586
AmITheRWord Avatar asked Oct 14 '10 15:10

AmITheRWord


1 Answers

I think the weighted means approach is a good one. But I think there are two things you need to work out.

  1. How to weigh the criteria.
  2. How to prevent "gaming" of the system

How to weigh the criteria

This question falls under the domain of Multi-Criteria Decision Analysis. Your approach is the Weighted Sum Model. In any computational decision making process, ranking the criteria is often the most difficult part of the process. I suggest you take the route of pairwise comparisons: how important do you think each criterion is compared to the others? Build yourself a table like this:

    c1     c2    c3   ...

c1  1      4      2

c2  1/4    1     1/2

c3  1/2    2      1

...

This shows that C1 is 4 times as important as C2 which is half as important as C3. Use a finite pool of weightings, say 1.0 since that's easy. Distributing it over the criteria we have 4 * C1 + 2 * C3 + C2 = 1 or roughly C1 = 4/7, C3 = 2/7, C2 = 1/7. Where discrepencies arise (for instance if you think C1 = 2*C2 = 3*C3, but C3 = 2*C2), that's a good error indication: it means that you're inconsistent with your relative rankings so go back and reexamine them. I forget the name of this procedure, comments would be helpful here. This is all well documented.

Now, this all probably seems a bit arbitrary to you at this point. They're for the most part numbers you pulled out of your own head. So I'd suggest taking a sample of maybe 30 articles and ranking them in the way "your gut" says they should be ordered (often you're more intuitive than you can express in numbers). Finagle the numbers until they produce something close to that ordering.

Preventing gaming

This is the second important aspect. No matter what system you use, if you can't prevent "cheating" it will ultimately fail. You need to be able to limit voting (should an IP be able to recommend a story twice?). You need to be able to prevent spam comments. The more important the criterion, the more you need to prevent it from being gamed.

like image 152
Mark Peters Avatar answered Nov 03 '22 01:11

Mark Peters