Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm that searches for related items based on common tags

Lets take StackOverflow questions as example. Each of them has multiple tags assigned. How to build an algorithm that would find related questions based on how many common tags they have (sorted by number of common tags)?

For now I can't think about anything better than just selecting all questions that have at least one common tag into an array and then looping through them all assigning number of common tags to each item, then sorting this array.

Is there more clever way of doing it? Perfect solution would be a single sql query.

like image 605
serg Avatar asked Oct 12 '09 19:10

serg


2 Answers

This could be as bad as O(n^2), but it works:

create table QuestionTags (questionid int, tag int);

select q1.questionid, q2.questionid, count(*) as commontags
from QuestionTags q1 join QuestionTags q2 
where q1.tag = q2.tag and q1.questionid < q2.questionid
group by q1.questionid, q2.questionid order by commontags desc;
like image 59
Keith Randall Avatar answered Nov 11 '22 09:11

Keith Randall


I didn't have time to optimize the WHERE IN() clause which is slow as death. I also didn't both to include indexes or specify the engine or collations but this should hopefully do the trick. Please point out any mistakes as I have not actually tested this sloppy code:

CREATE TABLE question (
    id INT(11) UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    title VARCHAR(255) NOT NULL,
    question MEDIUMTEXT
);

CREATE TABLE question_rel_tag (
    id INT(11) UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    question_id INT(11) UNSIGNED NOT NULL,
    tag_id INT(11) UNSIGNED NOT NULL
);

CREATE TABLE tag (
    id INT(11) UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(20) NOT NULL
);

SELECT question.id, question.title, question.question, count(tag.id) AS `count`
FROM question
INNER JOIN question_rel_tag ON (question.id = question_rel_tag.question_id)
INNER JOIN tag ON (question_rel_tag.tag_id = tag.id)
WHERE question.id != YOUR_QUESTION_ID_HERE
AND tag.id IN 
    (SELECT tag.id FROM question
    INNER JOIN question_rel_tag ON (question.id = question_rel_tag.question_id)
    INNER JOIN tag ON (question_rel_tag.tag_id = tag.id)
    WHERE question.id = YOUR_QUESTION_ID_HERE)
GROUP BY tag.id
ORDER BY `count` DESC
like image 34
Corey Ballou Avatar answered Nov 11 '22 09:11

Corey Ballou