Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching similar entities based on many to many relationship

I have two entities in my database that are connected with a many to many relationship. I was wondering what would be the best way to list which entities have the most similarities based on it?

I tried doing a count(*) with intersect, but the query takes too long to run on every entry in my database (there are about 20k records). When running the query I wrote, CPU usage jumps to 100% and the database has locking issues.

Here is some code showing what I've tried:

My tables look something along these lines:

/* 20k records */
create table Movie(
   Id INT PRIMARY KEY,
   Title varchar(255)
);

/* 200-300 records */
create table Tags(
   Id INT PRIMARY KEY,
   Desc varchar(255)
);

/* 200,000-300,000 records */
create table TagMovies(
    Movie_Id INT,
    Tag_Id INT,
    PRIMARY KEY (Movie_Id, Tag_Id),
    FOREIGN KEY (Movie_Id) REFERENCES Movie(Id),
    FOREIGN KEY (Tag_Id) REFERENCES Tags(Id),
);

(This works, but it is terribly slow) This is the query that I wrote to try and list them: Usually I also filter with top 1 & add a where clause to get a specific set of related data.

SELECT 
    bk.Id,
    rh.Id
FROM
    Movies bk
    CROSS APPLY (
        SELECT TOP 15
           b.Id,
           /* Tags Score */
           (
            SELECT COUNT(*) FROM (
                SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = bk.Id
                INTERSECT
                SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = b.Id
                ) Q1
           )
           as Amount
        FROM 
            Movies b 
        WHERE 
            b.Id <> bk.Id
        ORDER BY Amount DESC
    ) rh

Explanation: Movies have tags and the user can get try to find movies similar to the one that they selected based on other movies that have similar tags.

like image 202
newb Avatar asked Mar 15 '16 19:03

newb


2 Answers

Hmm ... just an idea, but maybe I didnt understand ... This query should return best matched movies by tags for a given movie ID:

SELECT m.id, m.title, GROUP_CONCAT(DISTINCT t.Descr SEPARATOR ', ') as tags, count(*) as matches
FROM stack.Movie m 
LEFT JOIN stack.TagMovies tm ON m.Id = tm.Movie_Id
LEFT JOIN stack.Tags t ON tm.Tag_Id = t.Id
WHERE m.id != 1
AND tm.Tag_Id IN (SELECT Tag_Id FROM stack.TagMovies tm WHERE tm.Movie_Id = 1)
GROUP BY m.id
ORDER BY matches DESC
LIMIT 15;

EDIT: I just realized that it's for M$ SQL ... but maybe something similar can be done...

like image 166
barat Avatar answered Oct 10 '22 15:10

barat


You should probably decide on a naming convention and stick with it. Are tables singular or plural nouns? I don't want to get into that debate, but pick one or the other.

Without access to your database I don't know how this will perform. It's just off the top of my head. You could also limit this by the M.id value to find the best matches for a single movie, which I think would improve performance by quite a bit.

Also, TOP x should let you get the x closest matches.

SELECT
    M.id,
    M.title,
    SM.id AS similar_movie_id,
    SM.title AS similar_movie_title,
    COUNT(*) AS matched_tags
FROM
    Movie M
INNER JOIN TagsMovie TM1 ON TM1.movie_id = M.movie_id
INNER JOIN TagsMovie TM2 ON
    TM2.tag_id = TM1.tag_id AND
    TM2.movie_id <> TM1.movie_id
INNER JOIN Movie SM ON SM.movie_id = TM2.movie_id
GROUP BY
    M.id,
    M.title,
    SM.id AS similar_movie_id,
    SM.title AS similar_movie_title
ORDER BY
    COUNT(*) DESC
like image 44
Tom H Avatar answered Oct 10 '22 14:10

Tom H