Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to accurately use aggregate functions when querying multiple tables?

Tags:

sql

mysql

I am having a harder time than I would expect writing a query that includes aggregate functions, queries multiple tables, and results in accurate numbers and hoping I can get some help.

SQL Fiddle

Category table example

The category is what I need to report on:

|----|-----------|
| id | name      |
|----|-----------|
| 1  | furniture |
| 2  | music     |
| 3  | kitchen   |
| 4  | adventure |
|----|-----------|

Product table example

Product table example:

|-----|----------------|-------------|
| id  | name           | category_id |
|-----|----------------|-------------|
| 101 | couch          | 1           |
| 102 | chair          | 1           |
| 103 | drum           | 2           |
| 104 | flute          | 2           |
| 105 | pot            | 3           |
| 106 | pan            | 3           |
| 107 | kitchen sink   | 3           |
| 108 | unicorn saddle | 4           |
| 109 | unicorn shoes  | 4           |
| 110 | horse shampoo  | 4           |
|-----|----------------|-------------|

Activity table example

The views data we want to sum (by category) found in the activity table:

|----|------------|-------|
| id | product_id | views |
|----|------------|-------|
| 1  | 101        | 1000  |
| 2  | 102        | 2000  |
| 3  | 103        | 3000  |
| 4  | 104        | 4000  |
| 5  | 105        | 5000  |
| 6  | 106        | 6000  |
| 7  | 107        | 7000  |
| 8  | 108        | 8000  |
| 9  | 109        | 9000  |
| 10 | 110        | 10000 |
|----|------------|-------|

Sales table example

The sales table that we want to put query the average sale (again by category). Note that the vendor_id is important as a single product can be carried by multiple vendors. I am leaving out the vendor table as its not needed for this question (we can just query using the vendor ID in later examples).

|----|------------|-----------|--------|
| id | product_id | vendor_id | amount |
|----|------------|-----------|--------|
| 1  | 101        | 1         | 1000   |
| 2  | 102        | 1         | 900    |
| 3  | 103        | 1         | 2000   |
| 4  | 105        | 1         | 3000   |
| 5  | 107        | 1         | 5000   |
| 6  | 101        | 2         | 600    |
| 7  | 103        | 2         | 7000   |
| 8  | 105        | 2         | 8000   |
| 9  | 107        | 2         | 1000   |
| 10 | 108        | 1         | 500    |
| 11 | 109        | 1         | 600    |
| 12 | 108        | 2         | 400    |
| 13 | 109        | 2         | 500    |
|----|------------|-----------|--------|

Desired Output

Below is the desired output:

**Note that some vendors do not carry some products, therefore, meaning that there is no average sales amount..or, in other words, there are no records in the sales table for some products found in the product table (e.g. no vendors carry horse shampoo). For this reason, I want to make sure that any averages or sums that I am using are in fact accurate. To be specific here, if a **.

|-----------|----------------|-----------|---------------|-------------------------------|-------------------------|
| category  | count_products | sum_views | average_sales | sum_views_where_sales_=>_1000 | sum_views_sales_<_1000  |
|-----------|----------------|-----------|---------------|-------------------------------|-------------------------|
| adventure | 3              | 27000     | 500           | 0                             | 27000                   |
| furniture | 2              | 3000      | 833           | 0                             | 3000                    |
| kitchen   | 3              | 18000     | 3000          | 6000                          | 12000                   |
| music     | 2              | 7000      | 5000          | 7000                          | 0                       |
|-----------|----------------|-----------|---------------|-------------------------------|-------------------------|

Current state of query

First to get an accurate count on products and views:

SELECT cat.name AS category,
        count(distinct p.name) AS product,
        sum(a.views) AS views
    FROM
        category AS cat,
        product AS p,
        activity AS a
    WHERE
        cat.id=p.category_id
    AND
        p.id=a.product_id
    GROUP BY 
        category;

sidenote: I would love to not have to use the distinct in the above query. Any ideas here would be great.

Accurate result displaying views by category:

|-----------|---------|-------|
| category  | product | views |
|-----------|---------|-------|
| Adventure | 3       | 27000 |
| Furniture | 2       | 3000  |
| Kitchen   | 3       | 18000 |
| Music     | 2       | 7000  |
|-----------|---------|-------|

Everything looks good until I start joining to the other tables:

SELECT cat.name AS category,
        count(distinct p.name) AS product,
        sum(a.views) AS views,
        round(avg(s.amount)) AS sales_amount
    FROM
        category AS cat,
        product AS p,
        activity AS a,
        sales AS s
    WHERE
        cat.id=p.category_id
    AND
        p.id=a.product_id
    AND
        p.id=s.product_id
    AND 
        s.vendor_id=1
    GROUP BY 
        category;

PROBLEM OUTPUT

|-----------|---------|-------|------------------|
| category  | product | views | avg_sales_amount |
|-----------|---------|-------|------------------|
| Adventure | 2       | 17000 | 550              |
| Furniture | 2       | 3000  | 950              |
| Kitchen   | 2       | 12000 | 4000             |
| Music     | 1       | 3000  | 2000             |
|-----------|---------|-------|------------------|

As you can notice I am getting further from the desired output when I start querying by vendor_id to get the average sales amount. To be specific, the product column no longer results in the correct amount of products because not all vendors carry all the same products making the s.vendor_id=1 filter difficult. I have to use it to be able to filter these reports by vendor while still getting accurate sums on the view field.

I have tried the above queries using a LEFT JOIN but still ends up with innacurrate results and not sure what needs to happen, possibly a sub query of some sort?

like image 321
lcm Avatar asked May 12 '17 09:05

lcm


2 Answers

Your report requirement is surprisingly complex. You may have gone into this project thinking it is far simpler than it is.

This is a case where you are reporting on summaries based on independently tabulated measures (views and sales).

Therefore, you need to start with aggregate subqueries that do not join the two detailed measurement tables together. This is one such query. It gets you views by category. http://sqlfiddle.com/#!9/02f4b6/31/0

                   SELECT c.id category_id, SUM(a.views) views
                     FROM activity a
                     JOIN product p ON a.product_id = p.id
                     JOIN category c ON p.category_id = c.id
                    GROUP BY c.id

This is the other such query. It gets you sales by category. http://sqlfiddle.com/#!9/02f4b6/32/0

                   SELECT c.id category_id, 
                          SUM(s.amount) total_sales, 
                          AVG(s.amount) avg_sales
                     FROM sales s
                     JOIN product p ON s.product_id = p.id
                     JOIN category c ON p.category_id = c.id
                    GROUP BY c.id

Next you need the count of products by category. Fortunately each product can only be in one category. http://sqlfiddle.com/#!9/02f4b6/42/0

                   SELECT c.id category_id, 
                          COUNT(*) products 
                     FROM product p 
                     JOIN category c ON p.category_id = c.id
                    GROUP BY c.id

Now, it's necessary to join these items together. Start with the category table and LEFT JOIN the other three, like this. http://sqlfiddle.com/#!9/02f4b6/51/0

SELECT c.name, aggproducts.products,
       aggviews.views, aggsales.avg_sales, 
       aggsales.total_sales
  FROM category c
  LEFT JOIN (
                   SELECT c.id category_id, SUM(a.views) views
                     FROM activity a
                     JOIN product p ON a.product_id = p.id
                     JOIN category c ON p.category_id = c.id
                    GROUP BY c.id
       ) aggviews ON c.id = aggviews.category_id
  LEFT JOIN (
                   SELECT c.id category_id, 
                          SUM(s.amount) total_sales, 
                          AVG(s.amount) avg_sales
                     FROM sales s
                     JOIN product p ON s.product_id = p.id
                     JOIN category c ON p.category_id = c.id
                    GROUP BY c.id
       ) aggsales ON c.id = aggsales.category_id
  LEFT JOIN (
                   SELECT c.id category_id, 
                          COUNT(*) products 
                     FROM product p 
                     JOIN category c ON p.category_id = c.id
                    GROUP BY c.id
       ) aggproducts ON c.id = aggproducts.category_id

The trick is to create aggregate subqueries for each measure that contain either zero or one rows per category. If any aggregate subqueries contains more than one row per category, you start getting duplication of rows because of JOIN combinatorial explosion.

Then you LEFT JOIN those aggregate subqueries to the category table. Don't use ordinary JOIN because that will suppress rows from the result if any of the aggregate subqueries are missing the particular category.

Notice that you are using these subqueries as if they were tables. This ability to build up queries from subqueries is what puts the structured in structured query language.

Those are the basics. Now you need yet another aggregate subquery for those conditional sums. I'm going to leave that one to you.

like image 156
O. Jones Avatar answered Sep 29 '22 09:09

O. Jones


Please try the following...

SELECT Category.name AS category,
       COUNT( * ) AS count_Product,
       SUM( views ) AS sum_views,
       ROUND( COALESCE( SUM( sumAmount ) / SUM( countAmounts ), 0 ) ) AS average_sales,
       SUM( whereGreater ) AS 'sum_views_where_sales_=>_1000',
       SUM( whereLesser ) AS 'sum_views_sales_<_1000'
FROM Category
JOIN Product ON Category.id = Product.category_id
JOIN Activity ON Product.id = Activity.product_id
LEFT JOIN ( SELECT product_id AS product_id,
                   SUM( amount ) AS sumAmount,
                   COUNT( * ) AS countAmounts
            FROM Sales
            GROUP BY product_id
     ) sumCountAmountFinder ON Product.id = sumCountAmountFinder.product_id
JOIN ( SELECT Activity.product_id AS product_id,
              SUM( CASE WHEN COALESCE( meanAmount, 0 ) >= 1000 THEN views ELSE 0 END ) AS whereGreater,
              SUM( CASE WHEN COALESCE( meanAmount, 0 ) < 1000 THEN views ELSE 0 END ) AS whereLesser
       FROM Activity
       LEFT JOIN ( SELECT product_id AS product_id,
                          SUM( amount ) / COUNT( * ) AS meanAmount
                   FROM Sales
                   GROUP BY product_id
                 ) AS meanAmountFinder ON Activity.product_id = meanAmountFinder.product_id
       GROUP BY Activity.product_id
     ) sumWhereFinder ON Product.id = sumWhereFinder.product_id
GROUP BY Category.name;

Assumptions

That a record in Category will always have at least one record in Product associated with it.

That every record in Product will have a corresponding record in Activity.

Explanation

My statement starts by performing an INNER JOIN between Category and Product, effectively giving us a list of Products associated with each Category.

An INNER JOIN is performed between Activity and the above joined dataset, effectively appending each value of the views field to its corresponding Product record.

The following subquery is then used to determine the total of amount for each product_id IN Sales and the number of records for each product_id. A LEFT JOIN is then performed between Product and and the subquery, effectively appending each record from the subquery to its corresponding Product in the above joined dataset. A LEFT JOIN was used rather than an INNER JOIN as not all records from Product will have corresponding records in Sales and we do not wish for any record from Product to be excluded on the basis of that absence.

SELECT product_id AS product_id,
       SUM( amount ) AS sumAmount,
       COUNT( * ) AS countAmounts
FROM Sales
GROUP BY product_id

Another subquery (as follows) is then used to calculate the mean average of amount for each product_id in Sales.

SELECT product_id AS product_id,
       SUM( amount ) / COUNT( * ) AS meanAmount
FROM Sales
GROUP BY product_id

A LEFT JOIN is then performed between Activity and the results of the subquery. The mean average of amount for each product_id is then compared to 1000 and that product_id's corresponding value of views is placed in the appropriate field, with 0 being placed in the other. If a value of product_id does not have any corresponding records in Sales, then 0 is placed in both fields. The resulting produced by the parent subquery (which follows) is then joined to the above joined dataset.

SELECT Activity.product_id AS product_id,
       SUM( CASE WHEN COALESCE( meanAmount, 0 ) >= 1000 THEN views ELSE 0 END ) AS whereGreater,
       SUM( CASE WHEN COALESCE( meanAmount, 0 ) < 1000 THEN views ELSE 0 END ) AS whereLesser
FROM Activity
LEFT JOIN ( SELECT product_id AS product_id,
                   SUM( amount ) / COUNT( * ) AS meanAmount
            FROM Sales
            GROUP BY product_id
          ) AS meanAmountFinder ON Activity.product_id = meanAmountFinder.product_id
GROUP BY Activity.product_id

With our final joined dataset now formed, the records are then grouped by their value of Category.name. Each group's value of Category.name and its corresponding aggregated values are calculated and returned.

Testing

My statement was tested against a sample database created using the following code...

CREATE TABLE Category
(
    id     INT,
    name   VARCHAR( 50 )
);
INSERT INTO Category ( id,
                       name )
VALUES ( 1, 'furniture' ),
       ( 2, 'music' ),
       ( 3, 'kitchen' ),
       ( 4, 'adventure' );
CREATE TABLE Product
(
    id            INT,
    name          VARCHAR( 50 ),
    category_id   INT
);
INSERT INTO Product ( id,
                      name,
                      category_id )
VALUES ( 101, 'couch',          1 ),
       ( 102, 'chair',          1 ),
       ( 103, 'drum',           2 ),
       ( 104, 'flute',          2 ),
       ( 105, 'pot',            3 ),
       ( 106, 'pan',            3 ),
       ( 107, 'kitchen sink',   3 ),
       ( 108, 'unicorn saddle', 4 ),
       ( 109, 'unicorn shoes',  4 ),
       ( 110, 'horse shampoo',  4 );
CREATE TABLE Activity
(
    id           INT,
    product_id   INT,
    views        INT
);
INSERT INTO Activity ( id,
                       product_id,
                       views )
VALUES ( 1,  101, 1000  ),
       ( 2,  102, 2000  ),
       ( 3,  103, 3000  ),
       ( 4,  104, 4000  ),
       ( 5,  105, 5000  ),
       ( 6,  106, 6000  ),
       ( 7,  107, 7000  ),
       ( 8,  108, 8000  ),
       ( 9,  109, 9000  ),
       ( 10, 110, 10000 );
CREATE TABLE Sales
(
    id           INT,
    product_id   INT,
    vendor_id    INT,
    amount       INT
);
INSERT INTO Sales ( id,
                    product_id,
                    vendor_id,
                    amount )
VALUES ( 1,  101, 1, 1000 ),
       ( 2,  102, 1, 900  ),
       ( 3,  103, 1, 2000 ),
       ( 4,  105, 1, 3000 ),
       ( 5,  107, 1, 5000 ),
       ( 6,  101, 2, 600  ),
       ( 7,  103, 2, 7000 ),
       ( 8,  105, 2, 8000 ),
       ( 9,  107, 2, 1000 ),
       ( 10, 108, 1, 500  ),
       ( 11, 109, 1, 600  ),
       ( 12, 108, 2, 400  ),
       ( 13, 109, 2, 500 );

If you have any questions or comments, then please feel free to post a Comment accordingly.

Further Reading

https://dev.mysql.com/doc/refman/5.7/en/case.html (on MySQL's CASE statement)

https://dev.mysql.com/doc/refman/5.7/en/comparison-operators.html#function_coalesce (on MySQL's COALESCE() function)

https://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_count (on MySQL's 'COUNT()` aggregate function)

https://www.w3schools.com/sql/sql_join.asp (on various types of horizontal JOIN's in SQL)

https://dev.mysql.com/doc/refman/5.7/en/mathematical-functions.html#function_round (on MySQL's ROUND() function)

https://dev.mysql.com/doc/refman/5.7/en/select.html (on MySQL's SELECT statement)

https://dev.mysql.com/doc/refman/5.7/en/group-by-functions.html#function_sum (on MySQL's SUM() aggregate function)

Appendix

If both of my Assumptions are incorrect, then please try the following code...

SELECT Category.name AS category,
       COUNT( * ) AS count_Product,
       COALESCE( SUM( views ), '' ) AS sum_views,
       COALESCE( ROUND( SUM( sumAmount ) / SUM( countAmounts ), '' ) ) AS average_sales,
       COALESCE( SUM( whereGreater ), '' ) AS 'sum_views_where_sales_=>_1000',
       COALESCE( SUM( whereLesser ), '' ) AS 'sum_views_sales_<_1000'
FROM Category
LEFT JOIN Product ON Category.id = Product.category_id
LEFT JOIN Activity ON Product.id = Activity.product_id
LEFT JOIN ( SELECT product_id AS product_id,
                   SUM( amount ) AS sumAmount,
                   COUNT( * ) AS countAmounts
            FROM Sales
            GROUP BY product_id
     ) sumCountAmountFinder ON Product.id = sumCountAmountFinder.product_id
LEFT JOIN ( SELECT Activity.product_id AS product_id,
                   SUM( CASE WHEN COALESCE( meanAmount, 0 ) >= 1000 THEN views ELSE 0 END ) AS whereGreater,
                   SUM( CASE WHEN COALESCE( meanAmount, 0 ) < 1000 THEN views ELSE 0 END ) AS whereLesser
            FROM Activity
            LEFT JOIN ( SELECT product_id AS product_id,
                               SUM( amount ) / COUNT( * ) AS meanAmount
                        FROM Sales
                        GROUP BY product_id
                      ) AS meanAmountFinder ON Activity.product_id = meanAmountFinder.product_id
            GROUP BY Activity.product_id
          ) sumWhereFinder ON Product.id = sumWhereFinder.product_id
GROUP BY Category.name;
like image 20
toonice Avatar answered Sep 29 '22 09:09

toonice