Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL query returns same value in each column

I'm having an issue with SQL joins in a query that is designed to query the Post table having been joined to the comment, click and vote table and return stats about each posts activity. My query below is what I've been using.

SELECT
    p.PostID, 
    p.Title, 
    CASE 
        WHEN COUNT(cm.CommentID) IS NULL THEN 0
        ELSE COUNT(cm.CommentID)
    END AS CommentCount, 
    CASE
        WHEN COUNT(cl.ClickID) IS NULL THEN 0 
        ELSE COUNT(cl.ClickID)
    END AS ClickCount, 
    CASE
        WHEN SUM(vt.Value) IS NULL THEN 0
        ELSE SUM(vt.Value)
    END AS VoteScore
FROM 
    Post p
    LEFT OUTER JOIN Comment cm ON p.PostID = cm.PostID
    LEFT OUTER JOIN Click cl ON p.PostID = cl.PostID
    LEFT OUTER JOIN Vote vt ON p.PostID = vt.PostID
GROUP BY
    p.PostID, 
    p.Title

Yields the following result

| PostID | CommentCount | ClickCount | VoteScore |
|--------|--------------|------------|-----------|
| 41     |            60|          60|         60|
| 50     |          1683|        1683|       1683|

This, I know isn't correct. When comment out all but one of the joins:

SELECT
     p.PostID 
    ,p.Title 
    ,CASE 
        WHEN COUNT(cm.CommentID) IS NULL THEN 0
        ELSE COUNT(cm.CommentID)
     END AS CommentCount
/*
    ,CASE
        WHEN COUNT(cl.ClickID) IS NULL THEN 0 
        ELSE COUNT(cl.ClickID)
     END AS ClickCount
    ,CASE
        WHEN SUM(vt.Value) IS NULL THEN 0
        ELSE SUM(vt.Value)
     END AS VoteScore
 */
FROM 
    Post p
    LEFT OUTER JOIN Comment cm ON p.PostID = cm.PostID
/*
    LEFT OUTER JOIN Click cl ON p.PostID = cl.PostID
    LEFT OUTER JOIN Vote vt ON p.PostID = vt.PostID
 */
GROUP BY
    p.PostID, 
    p.Title

I get

| PostID | CommentCount |
|--------|--------------|
| 41     |             3|

Which is correct. Any ideas what I've done wrong?

Thanks.

like image 813
Andy Evans Avatar asked Feb 08 '23 09:02

Andy Evans


1 Answers

The result that is being returned is expected because the query is producing a Cartesian (or semi-Cartesian) product. The query is basically telling MySQL to perform "cross join" operations on the rows returned from comment, click and vote.

Each row returned from comment (for a given postid) gets matched to each row from click (for the same postid). And then each of the rows in that result gets matched to each row from vote (for the same postid).

So, for two rows from comment, and three rows from click and four rows from vote, that will return a total of 24 (=2x3x4) rows.

The usual pattern for fixing this is to avoid the cross join operations.

There are a couple of approaches to do that.


correlated subqueries in select list

If you only need a single aggregate (e.g. COUNT or SUM) from each of the three tables, you could remove the joins, and use correlated subqueries in the SELECT list. Write a query that gets a count for a single postid, for example

 SELECT COUNT(1)
   FROM comment cmt
  WHERE cmt.postid = ?

Then wrap that query in parens, and reference it in the SELECT list of another query, and replace the question mark to a reference to postid from the table referenced in the outer query.

SELECT p.postid
     , ( SELECT COUNT(1)
           FROM comment cmt
           WHERE cmt.postid = p.postid
       ) AS comment_count
  FROM post p

Repeat the same pattern to get "counts" from click and vote.

The downside of this approach is that the subquery in the SELECT list will get executed for each row returned by the outer query. So this can get expensive if the outer query returns a lot of rows. If comment is a large table, then to get reasonable performance, it's critical that there's appropriate index available on comment.


pre-aggregate in inline views

Another approach is to "pre-aggregate" the results inline views. Write a query that returns the comment count for postid. For example

SELECT cmt.postid 
     , COUNT(1) 
  FROM comment cmt
 GROUP BY cmt.postid 

Wrap that query in parens and reference it in the FROM clause of another query, assign an alias. The inline view query basically takes the place of a table in the outer query.

SELECT p.postid
     , cm.postid
     , cm.comment_count
  FROM post p 
  LEFT
  JOIN ( SELECT cmt.postid 
              , COUNT(1) AS comment_count
           FROM comment cmt
          GROUP BY cmt.postid
       ) cm
    ON cm.postid = p.postid

And repeat that same pattern for click and vote. The trick here is the GROUP BY clause in the inline view query that guarantees that it won't return any duplicate postid values. And a cartesian product (cross join) to that won't produce duplicates.

The downside of this approach is that the derived table won't be indexed. So for a large number of postid, it may be expensive to perform the join in the outer query. (More recent versions of MySQL partially address this downside, by automatically creating an appropriate index.)

(We can workaround this limitation by creating a temporary able with an appropriate index. But this approach requires additional SQL statements, and is not entirely suitable for an adhoc single statement. But for batch processing of large sets, the additional complexity can be worth it for some significant performance gains.


collapse Cartesian product by DISTINCT values

As an entirely different approach, leave your query like it is, with the cross join operations, and allow MySQL to produce the Cartesian product. Then the aggregates in the SELECT list can filter out the duplicates. This requires that you have a column (or expression produced) from comment that is UNIQUE for each row in comment for a given postid.

SELECT p.postid
     , COUNT(DISTINCT c.id) AS comment_count
  FROM post p
  LEFT
  JOIN comment c
    ON c.postid = p.postid
 GROUP BY p.postid

The big downside of this approach is that it has the potential to produce a huge intermediate result, which is then "collapsed" with a "Using filesort" operation (to satisfy the GROUP BY). And this can be pretty expensive for large sets.


This isn't an exhaustive list of all possible query patterns to achieve the result you are looking to return. Just a representative sampling.

like image 176
spencer7593 Avatar answered Feb 09 '23 22:02

spencer7593