Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an established pattern for SQL queries which group by a range?

I've seen a lot of questions on SO concerning how to group data by a range in a SQL query.

The exact scenarios vary, but the general underlying problem in each is to group by a range of values rather than each discrete value in the GROUP BY column. In other words, to group by a less precise granularity than you're storing in the database table.

This crops up often in the real world when producing things like histograms, calendar representations, pivot tables and other bespoke reporting outputs.

Some example data (tables unrelated):

|      OrderHistory       |       |         Staff        |                
---------------------------       ------------------------
|    Date    |  Quantity  |       |   Age     |   Name   |
---------------------------       ------------------------       
|01-Jul-2012 |     2      |       |    19     |   Barry  |
|02-Jul-2012 |     5      |       |    53     |   Nigel  |
|08-Jul-2012 |     1      |       |    29     |   Donna  |
|10-Jul-2012 |     3      |       |    26     |   James  |
|14-Jul-2012 |     4      |       |    44     |   Helen  |
|17-Jul-2012 |     2      |       |    49     |   Wendy  |
|28-Jul-2012 |     6      |       |    62     |   Terry  |
---------------------------       ------------------------

Now let's say we want to use the Date column of the OrderHistory table to group by weeks, i.e. 7-day ranges. Or perhaps group the Staff into 10-year age ranges:

|       Week      |  QtyCount  |        |  AgeGroup | NameCount |         
--------------------------------        -------------------------
|01-Jul to 07-Jul |     7      |        |   10-19   |    1      |
|08-Jul to 14-Jul |     8      |        |   20-29   |    2      | 
|15-Jul to 21-Jul |     2      |        |   30-39   |    0      |
|22-Jul to 28-Jul |     6      |        |   40-49   |    2      |
--------------------------------        |   50-59   |    1      |
                                        |   60-69   |    1      |
                                        -------------------------

GROUP BY Date and GROUP BY Age on their own won't do it.

The most common answers I see (none of which are consistently voted "correct") are to use one or more of:

  • a bunch of CASE statements, one per grouping
  • a bunch of UNION queries, with a different WHERE clause per grouping
  • as I'm working with SQL Server, PIVOT() and UNPIVOT()
  • a two-stage query using a sub-select, temp table or View construct

Is there an established generic pattern for dealing with such queries?

like image 770
Widor Avatar asked Jul 17 '12 16:07

Widor


2 Answers

You can use some of the dimensional modeling techniques, such as fact tables and dimension tables. Order History can act as a fact table with DateKey foreign key relation to a Date dimension. Date dimension can have a schema such as below:

Date Dimesion

Note that Date table is pre-filled with data up-to N number of years.

Using an example above, here is a sample query to get the result:

select CalendarWeek, sum(Quantity)
from OrderHistory a
join DimDate b
    on a.DateKey = b.DateKey
group by CalendarWeek

For Staff table, you can store Birthday Key instead of age and let the query calculate the age and ranges.

Here is SQL Fiddle

Date dimension population script was taken from here.

like image 145
Void Ray Avatar answered Sep 23 '22 22:09

Void Ray


As is often the case this SQL problem requires using more than one pattern in composition.

In this case the two you can use are

  • NTILE
  • Numbers Table

You can use NTITLE to create a set number of groups. However since you don't have each member of the groups represented you also need to use a numbers table Since you're using SQL Server you have it easy as you don't have to simulate either.

Here's an example for the Staff problem

WITH g as (
SELECT 
     NTILE(6) OVER (ORDER BY number) grp, 
     NUMBER
FROM 
    master..spt_values
WHERE 
    TYPE = 'P'
and number >=10 and number <=69
)
SELECT 
      CAST(min(g.number) as varchar) + ' - ' + 
      CAST(max(g.number) as varchar) AgeGroup ,
      COUNT(s.age) NameCount
FROM 
     g
     LEFT JOIN Staff s
     ON g.NUMBER = s.Age
GROUP BY 
    grp

DEMO

You can apply this to dates as well it just requires some date to day maniplulation

like image 40
Conrad Frix Avatar answered Sep 23 '22 22:09

Conrad Frix