I'd really appreciate it if someone could validate my SQL query.
For the following dataset:
MD5 UserPK CategoryPK
ADCDE 1 7
ADCDE 1 4
ADCDE 1 7
dffrf 1 7
dffrf 2 7
dffrf 2 6
dffrf 1 1
I'd like to select MD5 and CategoryPK where two or more rows exist with identical MD5 values, identical CatgegoryPK and two or more DIFFERENT UserPK values.
In other words, I'd like to know the MD5 and categoryPK of all records where two or more different users (UserPK) have assigned the same category (UserPK) to the same file (Md5). I'm not interested in records the same user has assigned the category to multiple times, (unless a different user has also assigned the same category to that file).
So from the above data, I would like to be returned just:
md5 CategoryPK
dffrf 7
The query I've written is:
SELECT md5,
count(md5),
count(distinct categorypk) as cntcat,
count(distinct userpk) as cntpk
FROM Hash
group by md5 having count(md5) > 1
and cntpk > 1
and cntcat = 1;
It seems to work, but before I start using it in anger, I'd appreciate a second opinion in case I've missed something or if there is a better way of doing it.
Thanks
I don't think your code will give you what you're after; what happens when a file has been assigned more than one category by multiple users, with some categories overlapping? Then cntcat != 1
, so your HAVING
clause will fail to match even though the file has indeed been categorised the same way by multiple users.
I would instead use a self-join:
SELECT a.MD5, a.CategoryPK
FROM Hash a
JOIN Hash b
ON a.MD5 = b.MD5
AND a.UserPK <> b.UserPK
AND a.CategoryPK = b.CategoryPK
GROUP BY a.MD5, a.CategoryPK
HAVING COUNT(DISTINCT a.UserPK) > 2 -- you said "more than 2" ?
I can't see any problems with what you have written apart from you are not getting the category in your select list which appears to be in the criteria? I think you could simplify it slightly and get the category out:
SELECT MD5, CategoryPK
FROM Hash
GROUP BY MD5, CategoryPK
HAVING MIN(UserPK) <> MAX(UserPK)
Alternatively, you could look at solving this with a join, you may need to run a few tests and use EXPLAIN, but sometimes joins perform better than GROUP BY. It is worth trying anyway to see if you see any significant difference.
SELECT DISTINCT t1.MDF, t2.CategoryPK
FROM Hash T1
INNER JOIN Hash T2
ON T1.MD5 = T2.MD5
AND T1.CategoryPK = T2.CategoryPK
AND T1.UserPK < T2.UserPK
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With