Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare group of tags to find similarity/score with PHP/MySQL

Tags:

php

mysql

tags

How can I compare a group of tags to another post's tags in my database to get related posts?

What I'm trying to do is compare a group of tags on a post to another post's tags, but not each tag individually. So say you wanted to get truly related items based on tags from a post and then show them from the most related to the least related. Each time there have to be three related items shown, no matter the relationship level.

Post A has the tags: "architecture", "wood", "modern", "switzerland"
Post B has the tags: "architecture", "wood", "modern"
Post C has the tags: "architecture", "modern", "stone"
Post D has the tags: "architecture", "house", "residence"

Post B is related to post A by 75% (3 related tags)
Post C is related to post A by 50% (2 related tags)
Post D is related to post A by 25% (1 related tag)

How can I do that? I'm currently using a 3-tables.

posts
> id
> image
> date

post_tags
> post_id
> tag_id

tags
> id
> name

I have searched the Internet and Stack Overflow to find out how to do this. My closest find was How to find "related items" in PHP, but it actually didn't solve much for me.

like image 200
stwhite Avatar asked Aug 10 '10 05:08

stwhite


1 Answers

NOTE: This solution is MySQL only, as MySQL has its own interpretation of GROUP BY

I've also used my own calculation of similarity. I've taken the number of identical tags and divided it by the average tag count in post A and post B. So if post A has 4 tags, and post B has 2 tags which are both shared with A, the similarity is 66%.

(SHARED:2 / ((A:4 + B:2)/2) or (SHARED:2) / (AVG:3)

It should be easy to change the formula if you want/need to...

SELECT
 sourcePost.id,
 targetPost.id,

 /* COUNT NUMBER OF IDENTICAL TAGS */
 /* REF GROUPING OF sourcePost.id and targetPost.id BELOW */
 COUNT(targetPost.id) /
 (
  (
   /* TOTAL TAGS IN SOURCE POST */
   (SELECT COUNT(*) FROM post_tags WHERE post_id = sourcePost.id)

   +

   /* TOTAL TAGS IN TARGET POST */
   (SELECT COUNT(*) FROM post_tags WHERE post_id = targetPost.id)

  ) / 2  /* AVERAGE TAGS IN SOURCE + TARGET */
 ) as similarity
FROM
 posts sourcePost
LEFT JOIN
 post_tags sourcePostTags ON (sourcePost.id = sourcePostTags.post_id)
INNER JOIN
 post_tags targetPostTags ON (sourcePostTags.tag_id = targetPostTags.tag_id
                             AND 
                              sourcePostTags.post_id != targetPostTags.post_id)
LEFT JOIN
 posts targetPost ON (targetPostTags.post_id = targetPost.id)
GROUP BY
 sourcePost.id, targetPost.id
like image 60
Ivar Bonsaksen Avatar answered Nov 11 '22 03:11

Ivar Bonsaksen