Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using SQL Aggregate Functions With Multiple Joins

I am attempting to use multiple aggregate functions across multiple tables in a single SQL query (using Postgres).

My table is structured similar to the following:

CREATE TABLE user (user_id INT PRIMARY KEY, user_date_created TIMESTAMP NOT NULL);
CREATE TABLE item_sold (item_sold_id INT PRIMARY KEY, sold_user_id INT NOT NULL);
CREATE TABLE item_bought (item_bought_id INT PRIMARY KEY, bought_user_id INT NOT NULL);

I want to count the number of items bought and sold for each user. The solution I thought up does not work:

SELECT user_id, COUNT(item_sold_id), COUNT(item_bought_id)
FROM user
LEFT JOIN item_sold ON sold_user_id=user_id
LEFT JOIN item_bought ON bought_user_id=user_id
WHERE user_date_created > '2014-01-01'
GROUP BY user_id;

That seems to perform all the combinations of (item_sold_id, item_bought_id), e.g. if there are 4 sold and 2 bought, both COUNT()s are 8.

How can I properly query the table to obtain both counts?

like image 380
Tanaki Avatar asked Jul 16 '14 19:07

Tanaki


3 Answers

The easy fix to your query is to use distinct:

SELECT user_id, COUNT(distinct item_sold_id), COUNT(distinct item_bought_id)
FROM user
LEFT JOIN item_sold ON sold_user_id=user_id
LEFT JOIN item_bought ON bought_user_id=user_id
WHERE user_date_created > '2014-01-01'
GROUP BY user_id;

However, the query is doing unnecessary work. If someone has 100 items bought and 200 items sold, then the join produces 20,000 intermediate rows. That is a lot.

The solution is to pre-aggregate the results or use a correlated subquery in the select. In this case, I prefer the correlated subquery solution (assuming the right indexes are available):

SELECT u.user_id,
       (select count(*) from item_sold s where u.user_id = s.sold_user_id),
       (select count(*) from item_bought b where u.user_id = b.bought_user_id)
FROM user u
WHERE u.user_date_created > '2014-01-01';

The right indexes are item_sold(sold_user_id) and item_bought(bought_user_id). I prefer this over pre-aggregation because of the filtering on the user table. This only does the calculations for users created this year -- that is harder to do with pre-aggregation.

like image 135
Gordon Linoff Avatar answered Nov 14 '22 02:11

Gordon Linoff


SQL Fiddle

With a lateral join it is possible to pre aggregate only the filtered users

select user_id, total_item_sold, total_item_bought
from
    "user" u
    left join lateral (
        select sold_user_id, count(*) as total_item_sold
        from item_sold
        where sold_user_id = u.user_id
        group by sold_user_id
    ) item_sold on user_id = sold_user_id
    left join lateral (
        select bought_user_id, count(*) as total_item_bought
        from item_bought
        where bought_user_id = u.user_id
        group by bought_user_id
    ) item_bought on user_id = bought_user_id
where u.user_date_created >= '2014-01-01'

Notice that you need >= in the filter otherwise it is possible to miss the exact first moment of the year. Although that timestamp is unlikely with naturally entered data, it is common with an automated job.

like image 31
Clodoaldo Neto Avatar answered Nov 14 '22 00:11

Clodoaldo Neto


Another way to solve this problem is to use two nested selects.

select user_id,
       (select count(*) from item_sold where sold_user_id = user_id),
       (select count(*) from item_bought where bought_user_id = user_id)
from user
where user_date_created > '2014-01-01'
like image 3
Piotr Miś Avatar answered Nov 14 '22 01:11

Piotr Miś