Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keeping rows from double-counting in a GROUP BY

Here's the basic guts of my schema and problem: http://sqlfiddle.com/#!1/72ec9/4/2

Note that the periods table can refer to a variable range of time - it could be an entire season, it could be a few games or one game. For a given team and year all period rows represent exclusive ranges of time.

I've got a query written which joins up tables and uses a GROUP BY periods.year to aggregate scores for a season (see sqlfiddle). However, if a coach had two positions in the same year the GROUP BY will count the same period row twice. How can I ditch the duplicates when a coach held two positions but still sum up periods when a year is comprised of multiple periods? If there's a better way to do the schema I'd also appreciate it if you pointed it out to me.

like image 933
ldrg Avatar asked Oct 03 '12 19:10

ldrg


People also ask

Does Count work with GROUP BY?

The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.

How do I count rows in a GROUP BY?

To count the number of rows, use the id column which stores unique values (in our example we use COUNT(id) ). Next, use the GROUP BY clause to group records according to columns (the GROUP BY category above). After using GROUP BY to filter records with aggregate functions like COUNT, use the HAVING clause.

Does GROUP BY reduce rows?

The GROUP BY clause restricts the rows of the result; only one row appears for each distinct value in the grouping column or columns.

Can we use GROUP BY without count?

Using COUNT, without GROUP BY clause will return a total count of a number of rows present in the table. Adding GROUP BY, we can COUNT total occurrences for each unique value present in the column.


2 Answers

The underlying problem (joining to multiple tables with multiple matches) is explained in this related answer:

  • Two SQL LEFT JOINS produce incorrect result

To fix, I first simplified & formatted your query:

select pe.year
     , sum(pe.wins)       AS wins
     , sum(pe.losses)     AS losses
     , sum(pe.ties)       AS ties
     , array_agg(po.id)   AS position_id
     , array_agg(po.name) AS position_names
from   periods_positions_coaches_linking pp
join   positions po ON po.id = pp.position
join   periods   pe ON pe.id = pp.period
where  pp.coach = 1
group  by pe.year
order  by pe.year;

Yields the same, incorrect result as your original, but simpler / faster / easier to read.

  • No point in joining the table coach as long as you don't use columns in the SELECT list. I removed it completely and replaced the WHERE condition with where pp.coach = 1.

  • You don't need COALESCE. NULL values are ignored in the aggregate function sum(). No need to substitute 0.

  • Use table aliases to make it easier to read.

Next, I solved your problem like this:

SELECT *
FROM  (
   SELECT pe.year
        , array_agg(DISTINCT po.id)   AS position_id
        , array_agg(DISTINCT po.name) AS position_names
   FROM   periods_positions_coaches_linking pp
   JOIN   positions                         po ON po.id = pp.position
   JOIN   periods                           pe ON pe.id = pp.period
   WHERE  pp.coach = 1
   GROUP  BY pe.year
   ) po
LEFT   JOIN (
   SELECT pe.year
        , sum(pe.wins)   AS wins
        , sum(pe.losses) AS losses
        , sum(pe.ties)   AS ties
   FROM  (
      SELECT period
      FROM   periods_positions_coaches_linking
      WHERE  coach = 1
      GROUP  BY period
      ) pp
   JOIN   periods pe ON pe.id = pp.period
   GROUP  BY pe.year
   ) pe USING (year)
ORDER  BY year;
  • Aggregate positions and periods separately before joining them.

  • In the first sub-query po list positions only once with array_agg(DISTINCT ...).

  • In the second sub-query pe ...

    • GROUP BY period, because a coach can have multiple positions per period.
    • JOIN to periods-data after that, and then aggregate to get sums.

db<>fiddle here
Old sqlfiddle

like image 83
Erwin Brandstetter Avatar answered Sep 19 '22 12:09

Erwin Brandstetter


use distinct as shown here

code:

select periods.year as year,
sum(coalesce(periods.wins, 0)) as wins,
sum(coalesce(periods.losses, 0)) as losses,
sum(coalesce(periods.ties, 0)) as ties,
array_agg( distinct positions.id) as position_id,
array_agg( distinct positions.name) as position_names

from periods_positions_coaches_linking

join coaches on coaches.id = periods_positions_coaches_linking.coach
join positions on positions.id = periods_positions_coaches_linking.position
join periods on periods.id = periods_positions_coaches_linking.period

where coaches.id = 1

group by periods.year, positions.id
order by periods.year;
like image 26
Teena Thomas Avatar answered Sep 21 '22 12:09

Teena Thomas