As the title reads, I have a problem with implementing a related articles algorithm. Let me start by listing the tables from the database:
[articles]
id_article
id_category
name
content
publish_date
is_deleted
[categories]
id_category
id_parent
name
[tags_to_articles]
id_tag
id_article
[tags]
id_tag
name
[articles_to_authors]
id_article
id_author
[authors]
id_author
name
is_deleted
[related_articles]
id_article_left
id_article_right
related_score
Every other table except related_articles has data in it. Now i want to fill related_articles with scores between articles (very important: the table will work as an oriented graph, the score of article A with article B could be different than the score between B and A, see the list). The score is computed like this:
I tried to make a query like this:
SELECT a.id, b.id, a.id_category, a.publish_date,
b.id_category, b.publish_date,
c.id_tag,
e.id_author
FROM `articles` a, articles b,
tags_to_articles c, tags_to_articles d,
articles_to_authors e, articles_to_authors f
WHERE a.id_article <> b.id_article AND
(
(a.id_article=c.id_article and c.id_tag=d.id_tag and d.id_article=b.id_article)
OR
(a.id=e.id_article and e.id_author=f.id_author and f.id_article=b.id_article)
OR
(a.id_category=b.id_category)
)
In theory, this would list every element worth computing for score. However, this takes way too much time and resources.
Is there another way? I'm also open to adjusting the algorithm or the tables if it gets a workable solution. Also worth noting is that the score calculations are done in a cron, of course I don't expect this to be running on every page request.
Steps Involved in Collaborative Filtering To build a system that can automatically recommend items to users based on the preferences of other users, the first step is to find similar users or items. The second step is to predict the ratings of the items that are not yet rated by a user.
The standard method of Collaborative Filtering is known as Nearest Neighborhood algorithm.
Memory-Based Collaborative Filtering approaches can be divided into two main sections: user-item filtering and item-item filtering. A user-item filtering takes a particular user, find users that are similar to that user based on similarity of ratings, and recommend items that those similar users liked.
For example, a collaborative filtering recommendation system for preferences in television programming could make predictions about which television show a user should like given a partial list of that user's tastes (likes or dislikes).
I seriously doubt you'd be able to do something like this with a single statement and get any kind of performance. Break it up into pieces. Use temp tables. Use set operations.
-- First, let's list all tables that share a category.
SELECT a1.id_article as 'left_article',
a2.id_article as 'right_article',
1 as 'score'
INTO #tempscore
FROM #articles a1
INNER JOIN #articles a2 ON
a1.id_category = a2.id_category
AND a1.id_article <> a2.id_article
-- Now, let's add up everything that shares an author
INSERT INTO #tempscore (left_article, right_article, score)
SELECT ata1.id_article,
ata2.id_article,
2
FROM #articles_to_authors ata1
INNER JOIN #articles_to_authors ata2 ON
ata1.id_author = ata2.id_author
-- Now, let's add up everything that shares a a tag
INSERT INTO #tempscore (left_article, right_article, score)
SELECT ata1.id_article,
ata2.id_article,
4
FROM #tags_to_articles ata1
INNER JOIN #tags_to_articles ata2 ON
ata1.id_tag = ata2.id_tag
-- We haven't looked at dates, yet, but let's go ahead and consolidate what we know.
SELECT left_article as 'left_article',
right_article as 'right_article',
SUM (score) as 'total_score'
INTO #cscore
FROM #tempscore
GROUP BY left_article,
right_article
-- Clean up some extranneous stuff
DELETE FROM #cscore WHERE left_article = right_article
-- Now we need to deal with dates
SELECT DateDiff (Day, art1.publish_date, art2.publish_date) as 'datescore',
art1.id_article as 'left_article',
art2.publish_date as 'right_article'
INTO #datescore
FROM #cscore
INNER JOIN #articles art1 ON
#cscore.left_article = art1.id_article
INNER JOIN #articles art2 ON
#cscore.right_article = art2.id_article
WHERE art1.publish_date > art2.publish_date
-- And finally, put it all together
INSERT INTO #related_articles (id_article_left, id_article_right, related_score)
SELECT s1.left_article,
s1.right_article,
s1.total_score + IsNull (s2.datescore, 0)
FROM #cscore s1
LEFT JOIN #datescore s2 ON
s1.left_article = s2.left_article
AND s1.right_article = s2.right_article
In my testing, the scores appeared to come out right, but I don't have any real sample data to go off of so I can't be sure. If nothing else, this should give you a basis to start from.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With