Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group data by the change of grouping column value in order

With the following data

create table #ph (product int, [date] date, price int)
insert into #ph select 1, '20120101', 1
insert into #ph select 1, '20120102', 1
insert into #ph select 1, '20120103', 1
insert into #ph select 1, '20120104', 1
insert into #ph select 1, '20120105', 2
insert into #ph select 1, '20120106', 2
insert into #ph select 1, '20120107', 2
insert into #ph select 1, '20120108', 2
insert into #ph select 1, '20120109', 1
insert into #ph select 1, '20120110', 1
insert into #ph select 1, '20120111', 1
insert into #ph select 1, '20120112', 1

I would like to produce the following output:

product | date_from | date_to  | price
  1     | 20120101  | 20120105 |   1
  1     | 20120105  | 20120109 |   2
  1     | 20120109  | 20120112 |   1

If I group by price and show the max and min date then I will get the following which is not what I want (see the over lapping of dates).

product | date_from | date_to  | price
  1     | 20120101  | 20120112 |   1
  1     | 20120105  | 20120108 |   2

So essentially what I'm looking to do is group by the step change in data based on group columns product and price.

What is the cleanest way to achieve this?

like image 442
MrEdmundo Avatar asked Apr 11 '12 16:04

MrEdmundo


People also ask

Does GROUP BY change order?

group by does not order the data neccessarily. A DB is designed to grab the data as fast as possible and only sort if necessary. So add the order by if you need a guaranteed order.

Does column order matter in GROUP BY?

The correct answer is "it depends". columns ordering in GROUP BY. an ORDER BY, so in this case we may say that the order of the columns in the GROUP BY clause did matter, but only for the data set ordering, NOT for its grouping.

Does GROUP BY comes first or order by?

Using Group By and Order By Together When combining the Group By and Order By clauses, it is important to bear in mind that, in terms of placement within a SELECT statement: The GROUP BY clause is placed after the WHERE clause. The GROUP BY clause is placed before the ORDER BY clause.


2 Answers

There's a (more or less) known technique of solving this kind of problem, involving two ROW_NUMBER() calls, like this:

WITH marked AS (
  SELECT
    *,
    grp = ROW_NUMBER() OVER (PARTITION BY product        ORDER BY date)
        - ROW_NUMBER() OVER (PARTITION BY product, price ORDER BY date)
  FROM #ph
)
SELECT
  product,
  date_from = MIN(date),
  date_to   = MAX(date),
  price
FROM marked
GROUP BY
  product,
  price,
  grp
ORDER BY
  product,
  MIN(date)

Output:

product  date_from   date_to        price 
-------  ----------  -------------  ----- 
1        2012-01-01  2012-01-04     1     
1        2012-01-05  2012-01-08     2     
1        2012-01-09  2012-01-12     1     
like image 73
Andriy M Avatar answered Sep 20 '22 07:09

Andriy M


I'm new to this forum so hope my contribution is helpful.

If you really don't want to use a CTE (although I think thats probably the best approach) you can get a solution using set based code. You will need to test the performance of this code!.

I have added in an extra temp table so that I can use a unique identifier for each record but I suspect you will already have this column in you source table. So heres the temp table.

    If Exists (SELECT Name FROM tempdb.sys.tables WHERE name LIKE '#phwithId%')
        DROP TABLE #phwithId    

    CREATE TABLE #phwithId
    (
        SaleId INT
        , ProductID INT
        , Price Money
        , SaleDate Date 
    )
    INSERT INTO #phwithId SELECT row_number() over(partition by product order by [date] asc) as SalesId, Product, Price, Date FROM ph 

Now the main body of the Select statement

    SELECT 
        productId 
        , date_from
        , date_to
        , Price
    FROM
        (   
            SELECT 
                dfr.ProductId
                , ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY ChangeDate) AS rowno1          
                , ChangeDate AS date_from
                , dfr.Price
            FROM
                (       
                    SELECT
                        sl1.ProductId AS ProductId
                        , sl1.SaleDate AS ChangeDate
                        , sl1.price
                    FROM
                        #phwithId sl1
                    LEFT JOIN
                        #phwithId sl2
                        ON sl1.SaleId = sl2.SaleId + 1
                    WHERE
                        sl1.Price <> sl2.Price OR sl2.Price IS NULL
                ) dfr
        ) da1
    LEFT JOIN
        (   
            SELECT 
                ROW_NUMBER() OVER (PARTITION BY ProductId ORDER BY ChangeDate) AS rowno2
                , ChangeDate AS date_to     
            FROM
                (   
                    SELECT 
                        sl1.ProductId
                        , sl1.SaleDate AS ChangeDate
                    FROM
                        #phwithId sl1
                    LEFT JOIN
                        #phwithId sl3
                        ON sl1.SaleId = sl3.SaleId - 1  
                    WHERE
                        sl1.Price <> sl3.Price OR sl3.Price IS NULL         
                ) dto

        ) da2 
        ON da1.rowno1 = da2.rowno2  

By binding the data source offset by 1 record (+or-) we can identify when the price buckets change and then its just a matter of getting the start and end dates for the buckets back into a single record.

All a bit fiddly and I'm not sure its going to give better performance but I enjoyed the challenge.

like image 38
redsevi Avatar answered Sep 24 '22 07:09

redsevi