Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to calculate a page importance based on its views / comments

I need an algorithm that allows me to determine an appropriate <priority> field for my website's sitemap based on the page's views and comments count.

For those of you unfamiliar with sitemaps, the priority field is used to signal the importance of a page relative to the others on the same website. It must be a decimal number between 0 and 1.

The algorithm will accept two parameters, viewCount and commentCount, and will return the priority value. For example:

GetPriority(100000, 100000); // Damn, a lot of views/comments! The returned value will be very close to 1, for example 0.995
GetPriority(3, 2); // Ok not many users are interested in this page, so for example it will return 0.082
like image 255
stacker Avatar asked May 27 '10 02:05

stacker


4 Answers

You mentioned doing this in an SQL query, so I'll give samples in that.

If you have a table/view Pages, something like this

Pages
-----
page_id:int
views:int  - indexed
comments:int - indexed

Then you can order them by writing

SELECT * FROM Pages
ORDER BY 
    (0.3+LOG10(10+views)/LOG10(10+(SELECT MAX(views) FROM Pages))) +       
    (0.7+LOG10(10+comments)/LOG10(10+(SELECT MAX(comments) FROM Pages)))

I've deliberately chosen unequal weighting between views and comments. A problem that can arise with keeping an equal weighting with views/comments is that the ranking becomes a self-fulfilling prophecy - a page is returned at the top of the list, so it's visited more often, and thus gets more points, so it's shown at the stop of the list, and it's visited more often, and it gets more points.... Putting more weight on on the comments reflects that these take real effort and show real interest.

The above formula will give you ranking based on all-time statistics. So an article that amassed the same number of views/comments in the last week as another article amassed in the last year will be given the same priority. It may make sense to repeat the formula, each time specifying a range of dates, and favoring pages with higher activity, e.g.

  0.3*(score for views/comments today) - live data
  0.3*(score for views/comments in the last week)
  0.25*(score for views/comments in the last month)
  0.15*(score for all views/comments, all time)

This will ensure that "hot" pages are given higher priority than similarly scored pages that haven't seen much action lately. All values apart from today's scores can be persisted in tables by scheduled stored procedures so that the database isn't having to aggregate many many comments/view stats. Only today's stats are computed "live". Taking it one step further, the ranking formula itself can be computed and stored for historical data by a stored procedure run daily.

EDIT: To get a strict range from 0.1 to 1.0, you would motify the formula like this. But I stress - this will only add overhead and is unecessary - the absolute values of priority are not important - only their relative values to other urls. The search engine uses these to answer the question, is URL A more important/relevant than URL B? It does this by comparing their priorities - which one is greatest - not their absolute values.

// unnormalized - x is some page id un(x) = 0.3*log(views(x)+10)/log(10+maxViews()) + 0.7*log(comments(x)+10)/log(10+maxComments()) // the original formula (now in pseudo code)

The maximum will be 1.0, the minimum will start at 1.0 and move downwards as more views/comments are made.

we define un(0) as the minimum value, i.e. (where views(x) and comments(x) are both 0 in the above formula)

To get a normalized formula from 0.1 to 1.0, you then compute n(x), the normalized priority for page x

                  (1.0-un(x)) * (un(0)-0.1)
  n(x) = un(x) -  -------------------------    when un(0) != 1.0
                          1.0-un(0)

       = 0.1 otherwise.
like image 171
mdma Avatar answered Sep 21 '22 03:09

mdma


Priority = W1 * views / maxViewsOfAllArticles + W2 * comments / maxCommentsOfAllArticles with W1+W2=1

Although IMHO, just use 0.5*log_10(10+views)/log_10(10+maxViews) + 0.5*log_10(10+comments)/log_10(10+maxComments)

like image 23
William Entriken Avatar answered Sep 19 '22 03:09

William Entriken


What you're looking for here is not an algorithm, but a formula.

Unfortunately, you haven't really specified the details of what you want, so there's no way we can provide the formula to you.

Instead, let's try to walk through the problem together.

You've got two incoming parameters, the viewCount and the commentCount. You want to return a single number, Priority. So far, so good.

You say that Priority should range between 0 and 1, but this isn't really important. If we were to come up with a formula we liked, but resulted in values between 0 and N, we could just divide the results by N-- so this constraint isn't really relevant.

Now, the first thing we need to decide is the relative weight of Comments vs Views.

If page A has 100 comments and 10 views, and page B has 10 comments and 100 views, which should have a higher priority? Or, should it be the same priority? You need to decide what's right for your definition of Priority.

If you decide, for example, that comments are 5 times more valuable than views, then we can begin with a formula like

 Priority = 5 * Comments + Views

Obviously, this can be generalized to

Priority = A * Comments + B * Views

Where A and B are relative weights.

But, sometimes we want our weights to be exponential instead of linear, like

 Priority = Comment ^ A + Views ^ B

which will give a very different curve than the earlier formula.

Similarly,

 Priority = Comment ^ A * Views ^ B

will give higher value to a page with 20 comments and 20 views than one with 1 comment and 40 views, if the weights are equal.

So, to summarize:

You really ought to make a spreadsheet with sample values for Views and Comments, and then play around with various formulas until you get one that has the distribution that you are hoping for.

We can't do it for you, because we don't know how you want to value things.

like image 38
Michael Dorfman Avatar answered Sep 21 '22 03:09

Michael Dorfman


I know it has been a while since this was asked, but I encountered a similar problem and had a different solution.

When you want to have a way to rank something, and there are multiple factors that you're using to perform that ranking, you're doing something called multi-criteria decision analysis. (MCDA). See: http://en.wikipedia.org/wiki/Multi-criteria_decision_analysis

There are several ways to handle this. In your case, your criteria have different "units". One is in units of comments, the other is in units of views. Futhermore, you may want to give different weight to these criteria based on whatever business rules you come up with.

In that case, the best solution is something called a weighted product model. See: http://en.wikipedia.org/wiki/Weighted_product_model

The gist is that you take each of your criteria and turn it into a percentage (as was previously suggested), then you take that percentage and raise it to the power of X, where X is a number between 0 and 1. This number represents your weight. Your total weights should add up to one.

Lastly, you multiple each of the results together to come up with a rank. If the rank is greater than 1, than the numerator page has a higher rank than the denominator page.

Each page would be compared against every other page by doing something like:

  • p1C = page 1 comments
  • p1V = page 1 view
  • p2C = page 2 comments
  • p2V = page 2 views
  • wC = comment weight
  • wV = view weight

rank = (p1C/p2C)^(wC) * (p1V/p2V)^(wV)

The end result is a sorted list of pages according to their rank.

I've implemented this in C# by performing a sort on a collection of objects implementing IComparable.

like image 28
RMD Avatar answered Sep 21 '22 03:09

RMD