I'm having an issue with SQL joins in a query that is designed to query the Post table having been joined to the comment, click and vote table and return stats about each posts activity. My query below is what I've been using.
SELECT
p.PostID,
p.Title,
CASE
WHEN COUNT(cm.CommentID) IS NULL THEN 0
ELSE COUNT(cm.CommentID)
END AS CommentCount,
CASE
WHEN COUNT(cl.ClickID) IS NULL THEN 0
ELSE COUNT(cl.ClickID)
END AS ClickCount,
CASE
WHEN SUM(vt.Value) IS NULL THEN 0
ELSE SUM(vt.Value)
END AS VoteScore
FROM
Post p
LEFT OUTER JOIN Comment cm ON p.PostID = cm.PostID
LEFT OUTER JOIN Click cl ON p.PostID = cl.PostID
LEFT OUTER JOIN Vote vt ON p.PostID = vt.PostID
GROUP BY
p.PostID,
p.Title
Yields the following result
| PostID | CommentCount | ClickCount | VoteScore |
|--------|--------------|------------|-----------|
| 41 | 60| 60| 60|
| 50 | 1683| 1683| 1683|
This, I know isn't correct. When comment out all but one of the joins:
SELECT
p.PostID
,p.Title
,CASE
WHEN COUNT(cm.CommentID) IS NULL THEN 0
ELSE COUNT(cm.CommentID)
END AS CommentCount
/*
,CASE
WHEN COUNT(cl.ClickID) IS NULL THEN 0
ELSE COUNT(cl.ClickID)
END AS ClickCount
,CASE
WHEN SUM(vt.Value) IS NULL THEN 0
ELSE SUM(vt.Value)
END AS VoteScore
*/
FROM
Post p
LEFT OUTER JOIN Comment cm ON p.PostID = cm.PostID
/*
LEFT OUTER JOIN Click cl ON p.PostID = cl.PostID
LEFT OUTER JOIN Vote vt ON p.PostID = vt.PostID
*/
GROUP BY
p.PostID,
p.Title
I get
| PostID | CommentCount |
|--------|--------------|
| 41 | 3|
Which is correct. Any ideas what I've done wrong?
Thanks.
The result that is being returned is expected because the query is producing a Cartesian (or semi-Cartesian) product. The query is basically telling MySQL to perform "cross join" operations on the rows returned from comment
, click
and vote
.
Each row returned from comment
(for a given postid) gets matched to each row from click
(for the same postid). And then each of the rows in that result gets matched to each row from vote
(for the same postid).
So, for two rows from comment
, and three rows from click
and four rows from vote
, that will return a total of 24 (=2x3x4) rows.
The usual pattern for fixing this is to avoid the cross join operations.
There are a couple of approaches to do that.
correlated subqueries in select list
If you only need a single aggregate (e.g. COUNT or SUM) from each of the three tables, you could remove the joins, and use correlated subqueries in the SELECT list. Write a query that gets a count for a single postid, for example
SELECT COUNT(1)
FROM comment cmt
WHERE cmt.postid = ?
Then wrap that query in parens, and reference it in the SELECT list of another query, and replace the question mark to a reference to postid from the table referenced in the outer query.
SELECT p.postid
, ( SELECT COUNT(1)
FROM comment cmt
WHERE cmt.postid = p.postid
) AS comment_count
FROM post p
Repeat the same pattern to get "counts" from click
and vote
.
The downside of this approach is that the subquery in the SELECT list will get executed for each row returned by the outer query. So this can get expensive if the outer query returns a lot of rows. If comment
is a large table, then to get reasonable performance, it's critical that there's appropriate index available on comment
.
pre-aggregate in inline views
Another approach is to "pre-aggregate" the results inline views. Write a query that returns the comment count for postid. For example
SELECT cmt.postid
, COUNT(1)
FROM comment cmt
GROUP BY cmt.postid
Wrap that query in parens and reference it in the FROM clause of another query, assign an alias. The inline view query basically takes the place of a table in the outer query.
SELECT p.postid
, cm.postid
, cm.comment_count
FROM post p
LEFT
JOIN ( SELECT cmt.postid
, COUNT(1) AS comment_count
FROM comment cmt
GROUP BY cmt.postid
) cm
ON cm.postid = p.postid
And repeat that same pattern for click
and vote
. The trick here is the GROUP BY clause in the inline view query that guarantees that it won't return any duplicate postid values. And a cartesian product (cross join) to that won't produce duplicates.
The downside of this approach is that the derived table won't be indexed. So for a large number of postid, it may be expensive to perform the join in the outer query. (More recent versions of MySQL partially address this downside, by automatically creating an appropriate index.)
(We can workaround this limitation by creating a temporary able with an appropriate index. But this approach requires additional SQL statements, and is not entirely suitable for an adhoc single statement. But for batch processing of large sets, the additional complexity can be worth it for some significant performance gains.
collapse Cartesian product by DISTINCT values
As an entirely different approach, leave your query like it is, with the cross join operations, and allow MySQL to produce the Cartesian product. Then the aggregates in the SELECT list can filter out the duplicates. This requires that you have a column (or expression produced) from comment
that is UNIQUE for each row in comment for a given postid.
SELECT p.postid
, COUNT(DISTINCT c.id) AS comment_count
FROM post p
LEFT
JOIN comment c
ON c.postid = p.postid
GROUP BY p.postid
The big downside of this approach is that it has the potential to produce a huge intermediate result, which is then "collapsed" with a "Using filesort" operation (to satisfy the GROUP BY). And this can be pretty expensive for large sets.
This isn't an exhaustive list of all possible query patterns to achieve the result you are looking to return. Just a representative sampling.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With