Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping to find min,max for each group

This would be relatively easy if I only cared about a single min and max for each group, the problem is my requirement is to find the various boundaries. An example data set is as follows:

BoundaryColumn  GroupIdentifier
1                  A
3                  A
4                  A
7                  A
8                  B
9                  B  
11                 B  
13                 A
14                 A
15                 A
16                 A

What I need from the sql is a result set as follows:

min  max  groupid
1    7    A
8    11   B
13   16   A

Essentially finding the boundaries for each cluster of the groups.

The data would be stored in either oracle11g or mysql so syntax can be provided for either platform.

like image 278
user2887330 Avatar asked Jan 31 '26 17:01

user2887330


2 Answers

A disclaimer: It's a lot easier to query partial results and process something like this with a front-end language. That said...

The following query works for Oracle (which supports analytic queries) but not for MySQL (which does not). There's a SQL Fiddle here.

WITH BoundX AS (
  SELECT * FROM (
    SELECT
     BoundaryColumn,
     GroupIdentifier,
     LAG(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLag,
     LEAD(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLead
    FROM MyTable
    ORDER BY BoundaryColumn
  )
  WHERE GIDLag IS NULL OR GroupIdentifier <> GIDLag
     OR GIDLead IS NULL OR GroupIdentifier <> GIDLead
)
SELECT MIN, MAX, GROUPID
FROM (
  SELECT
    BoundaryColumn AS MIN,
    LEAD(BoundaryColumn) OVER (ORDER BY BoundaryColumn) AS MAX,
    GroupIdentifier AS GROUPID,
    GIDLag,
    GIDLead
  FROM BoundX
)
WHERE GROUPID = GIDLead

Here's the logic, step by step. You may be able to improve on this, because I get the feeling there's one subquery too many here...

This query pulls the prior and following GroupIdentifier values into each row:

SELECT
 BoundaryColumn,
 GroupIdentifier,
 LAG(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLag,
 LEAD(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLead
FROM MyTable
ORDER BY BoundaryColumn

The result looks like this:

BoundaryColumn  GroupIdentifier  GIDLag  GIDLead
1                  A                         A
3                  A                A        A
4                  A                A        A
7                  A                A        B
8                  B                A        B
9                  B                B        B
11                 B                B        A
13                 A                B        A
14                 A                A        A
15                 A                A        A
16                 A                A

If you add logic to get rid of all the rows where GIDLag = GIDLead = GroupIdentifier, you'll end up with the boundaries:

WITH BoundX AS (
  SELECT * FROM (
    SELECT
     BoundaryColumn,
     GroupIdentifier,
     LAG(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLag,
     LEAD(GroupIdentifier) OVER (ORDER BY BoundaryColumn) AS GIDLead
    FROM MyTable
    ORDER BY BoundaryColumn
  )
  WHERE GIDLag IS NULL OR GroupIdentifier <> GIDLag
     OR GIDLead IS NULL OR GroupIdentifier <> GIDLead
)
SELECT
  BoundaryColumn AS MIN,
  LEAD(BoundaryColumn) OVER (ORDER BY BoundaryColumn) AS MAX,
  GroupIdentifier AS GROUPID,
  GIDLag,
  GIDLead
FROM BoundX

With this addition the results are:

MIN MAX GROUPID GIDLAG GIDLEAD
--- --- ------- ------ -------
  1   7 A              A
  7   8 A       A      B
  8  11 B       A      B
 11  13 B       B      A
 13  16 A       B      A
 16     A       A

Finally, include only those rows where GroupID = GIDLead. That's the query at the top of this answer. The results are:

MIN MAX GROUPID
--- --- -------
  1   7 A
  8  11 B
 13  16 A
like image 178
Ed Gibbs Avatar answered Feb 02 '26 07:02

Ed Gibbs


Take a look at this site regarding "runs" of data: http://www.sqlteam.com/article/detecting-runs-or-streaks-in-your-data

Armed with the knowledge provided in that link, you could write a query like this:

SELECT BoundaryColumn,
GroupIdentifier,
(
SELECT COUNT(*)
FROM Table T
WHERE T.GroupIdentifier <> TR.GroupIdentifier
AND T.BoundaryColumn <= TR.BoundaryColumn
) as RunGroup
FROM Table TR

Using this information, you could then group by "RunGroup", and select the GroupIdentifier and min/max BoundaryColumn.

EDIT: I've felt the peer pressure, here's an SQLFiddle with my version of the answer: http://www.sqlfiddle.com/#!8/9a24c/4/0

like image 26
Klazen108 Avatar answered Feb 02 '26 09:02

Klazen108