I'm trying to find pairs of users that enjoy the same set of TV shows in this simplified example
Let's say I have a table where each user gets an entry for each TV Show that they enjoy:
|USER | Show |
|-----|-------------|
|001 | Lost |
|001 | South Park |
|002 | Lost |
|003 | Lost |
|003 | South Park |
|004 | South Park |
|005 | Lost |
|006 | Lost |
Then I would like a result of:
|USER1 |USER2 |
|------|------|
|001 |003 |
|003 |001 |
|002 |005 |
|002 |006 |
|005 |002 |
|005 |006 |
|006 |002 |
|006 |005 |
Or an even better version would be:
|USER1 |USER2 |
|------|------|
|001 |003 |
|002 |005 |
|002 |006 |
|005 |006 |
Which basically says: User 1 likes the same set of shows as User 3.
I've been playing around with GROUP BY and JOIN, but I still can't find the answer :(.
So far, I've found that using
SELECT s1.User as USER1, s2.User as USER2, s1.Show as Show
FROM Shows s1 JOIN (SELECT * FROM Shows) s2
ON s1.Shows=s2.Shows AND s1.User!=s2.User;
Which yields pairs of users and the Show they have in common. But I don't know where to go from here.
If you can accept CSV instead of tabulated results, you could simply group the table twice:
SELECT GROUP_CONCAT(User) FROM (
SELECT User, GROUP_CONCAT(DISTINCT `Show` ORDER BY `Show` SEPARATOR 0x1e) AS s
FROM Shows
GROUP BY User
) t GROUP BY s
Otherwise, you can join the above subquery to itself:
SELECT DISTINCT LEAST(t.User, u.User) AS User1,
GREATEST(t.User, u.User) AS User2
FROM (
SELECT User, GROUP_CONCAT(DISTINCT `Show` ORDER BY `Show` SEPARATOR 0x1e) AS s
FROM Shows
GROUP BY User
) t JOIN (
SELECT User, GROUP_CONCAT(DISTINCT `Show` ORDER BY `Show` SEPARATOR 0x1e) AS s
FROM Shows
GROUP BY User
) u USING (s)
WHERE t.User <> u.User
See them on sqlfiddle.
Of course, if duplicate (User, Show)
pairs are guaranteed not to exist in the Shows
table, you could improve performance by removing the DISTINCT
keyword from the GROUP_CONCAT()
aggregations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With