Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement a related articles algorithm using this form of collaborative filtering

Tags:

sql

mysql

As the title reads, I have a problem with implementing a related articles algorithm. Let me start by listing the tables from the database:

[articles]
id_article
id_category
name
content
publish_date
is_deleted

[categories]
id_category
id_parent
name

[tags_to_articles]
id_tag
id_article

[tags]
id_tag
name

[articles_to_authors]
id_article
id_author

[authors]
id_author
name
is_deleted

[related_articles]
id_article_left
id_article_right
related_score

The algorithm

Every other table except related_articles has data in it. Now i want to fill related_articles with scores between articles (very important: the table will work as an oriented graph, the score of article A with article B could be different than the score between B and A, see the list). The score is computed like this:

  • if the two articles in question have the same category, a number(x) is added to the score
  • for every author they have in common, a number(y) is added to the score
  • for every tag they have in common, a number(z) is added to the score
  • if we compute the score of article A with article B, the difference between now() and the publish_date of article B will generate a number(t) that will be subtracted from the score

My first (inefficient) approach

I tried to make a query like this:

SELECT a.id, b.id, a.id_category, a.publish_date,
    b.id_category, b.publish_date,
    c.id_tag,
    e.id_author
FROM `articles` a, articles b, 
        tags_to_articles c, tags_to_articles d,
        articles_to_authors e, articles_to_authors f
WHERE a.id_article <> b.id_article AND 
(
    (a.id_article=c.id_article and c.id_tag=d.id_tag and d.id_article=b.id_article)
    OR
    (a.id=e.id_article and e.id_author=f.id_author and f.id_article=b.id_article)
    OR
    (a.id_category=b.id_category)
)

In theory, this would list every element worth computing for score. However, this takes way too much time and resources.

Is there another way? I'm also open to adjusting the algorithm or the tables if it gets a workable solution. Also worth noting is that the score calculations are done in a cron, of course I don't expect this to be running on every page request.

like image 317
Valentin Brasso Avatar asked Nov 30 '11 08:11

Valentin Brasso


People also ask

How do you implement collaborative filtering?

Steps Involved in Collaborative Filtering To build a system that can automatically recommend items to users based on the preferences of other users, the first step is to find similar users or items. The second step is to predict the ratings of the items that are not yet rated by a user.

What algorithm is used in collaborative filtering?

The standard method of Collaborative Filtering is known as Nearest Neighborhood algorithm.

What are the main implementation strategies of collaborative filtering model?

Memory-Based Collaborative Filtering approaches can be divided into two main sections: user-item filtering and item-item filtering. A user-item filtering takes a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked.

What is collaborative filtering and give an example of how it is used?

For example, a collaborative filtering recommendation system for preferences in television programming could make predictions about which television show a user should like given a partial list of that user's tastes (likes or dislikes).


1 Answers

I seriously doubt you'd be able to do something like this with a single statement and get any kind of performance. Break it up into pieces. Use temp tables. Use set operations.

-- First, let's list all tables that share a category.
SELECT   a1.id_article as 'left_article',
         a2.id_article as 'right_article',
         1 as 'score'
INTO     #tempscore
FROM     #articles a1
   INNER JOIN #articles a2 ON
         a1.id_category = a2.id_category
     AND a1.id_article <> a2.id_article

-- Now, let's add up everything that shares an author
INSERT INTO #tempscore (left_article, right_article, score)
SELECT   ata1.id_article,
         ata2.id_article,
         2
FROM     #articles_to_authors ata1
   INNER JOIN #articles_to_authors ata2 ON
         ata1.id_author = ata2.id_author

-- Now, let's add up everything that shares a a tag
INSERT INTO #tempscore (left_article, right_article, score)
SELECT   ata1.id_article,
         ata2.id_article,
         4
FROM     #tags_to_articles ata1
   INNER JOIN #tags_to_articles ata2 ON
         ata1.id_tag = ata2.id_tag

-- We haven't looked at dates, yet, but let's go ahead and consolidate what we know.
SELECT   left_article as 'left_article',
         right_article as 'right_article',
         SUM (score) as 'total_score'
INTO     #cscore
FROM     #tempscore
GROUP BY left_article,
         right_article

-- Clean up some extranneous stuff
DELETE FROM #cscore WHERE left_article = right_article

-- Now we need to deal with dates
SELECT   DateDiff (Day, art1.publish_date, art2.publish_date) as 'datescore',
         art1.id_article as 'left_article',
         art2.publish_date as 'right_article'
INTO     #datescore
FROM     #cscore
   INNER JOIN #articles art1 ON
         #cscore.left_article = art1.id_article
   INNER JOIN #articles art2 ON
         #cscore.right_article = art2.id_article
WHERE    art1.publish_date > art2.publish_date

-- And finally, put it all together
INSERT INTO #related_articles (id_article_left, id_article_right, related_score)
SELECT   s1.left_article,
         s1.right_article,
         s1.total_score + IsNull (s2.datescore, 0)
FROM     #cscore s1
   LEFT  JOIN #datescore s2 ON
         s1.left_article = s2.left_article
     AND s1.right_article = s2.right_article

In my testing, the scores appeared to come out right, but I don't have any real sample data to go off of so I can't be sure. If nothing else, this should give you a basis to start from.

like image 186
Jason 'Bug' Fenter Avatar answered Oct 26 '22 19:10

Jason 'Bug' Fenter