Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping rows with a date range

I am using SQL Server 2008 and need to create a query that shows rows that fall within a date range.

My table is as follows:

ADM_ID    WH_PID     WH_IN_DATETIME    WH_OUT_DATETIME

My rules are:

  • If the WH_OUT_DATETIME is on or within 24 hours of the WH_IN_DATETIME of another ADM_ID with the same WH_P_ID

I would like another column added to the results which identify the grouped value if possible as EP_ID.

e.g.

ADM_ID    WH_PID    WH_IN_DATETIME         WH_OUT_DATETIME
------    ------    --------------         ---------------
1         9         2014-10-12 00:00:00    2014-10-13 15:00:00
2         9         2014-10-14 14:00:00    2014-10-15 15:00:00
3         9         2014-10-16 14:00:00    2014-10-17 15:00:00
4         9         2014-11-20 00:00:00    2014-11-21 00:00:00
5         5         2014-10-17 00:00:00    2014-10-18 00:00:00

Would return rows with:

ADM_ID   WH_PID   EP_ID   EP_IN_DATETIME        EP_OUT_DATETIME       WH_IN_DATETIME        WH_OUT_DATETIME
------   ------   -----   -------------------   -------------------   -------------------   -------------------
1        9        1       2014-10-12 00:00:00   2014-10-17 15:00:00   2014-10-12 00:00:00   2014-10-13 15:00:00
2        9        1       2014-10-12 00:00:00   2014-10-17 15:00:00   2014-10-14 14:00:00   2014-10-15 15:00:00
3        9        1       2014-10-12 00:00:00   2014-10-17 15:00:00   2014-10-16 14:00:00   2014-10-17 15:00:00
4        9        2       2014-11-20 00:00:00   2014-11-20 00:00:00   2014-10-16 14:00:00   2014-11-21 00:00:00
5        5        1       2014-10-17 00:00:00   2014-10-18 00:00:00   2014-10-17 00:00:00   2014-10-18 00:00:00

The EP_OUT_DATETIME will always be the latest date in the group. Hope this clarifies a bit. This way, I can group by the EP_ID and find the EP_OUT_DATETIME and start time for any ADM_ID/PID that fall within.


Each should roll into the next, meaning that if another row has an WH_IN_DATETIME which follows on the WH_OUT_DATETIME of another for the same WH_PID, than that row's WH_OUT_DATETIME becomes the EP_OUT_DATETIME for all of the WH_PID's within that EP_ID.

I hope this makes some sense.

Thanks, MR

like image 468
user4283270 Avatar asked Nov 23 '14 01:11

user4283270


1 Answers

Since the question does not specify that the solution be a "single" query ;-), here is another approach: using the "quirky update" feature dealy, which is updating a variable at the same time you update a column. Breaking down the complexity of this operation, I create a scratch table to hold the piece that is the hardest to calculate: the EP_ID. Once that is done, it gets joined into a simple query and provides the window with which to calculate the EP_IN_DATETIME and EP_OUT_DATETIME fields.

The steps are:

  1. Create the scratch table
  2. Seed the scratch table with all of the ADM_ID values -- this lets us do an UPDATE as all of the rows already exist.
  3. Update the scratch table
  4. Do the final, simple select joining the scratch table to the main table

The Test Setup

SET ANSI_NULLS ON;
SET NOCOUNT ON;

CREATE TABLE #Table
(
   ADM_ID INT NOT NULL PRIMARY KEY,
   WH_PID INT NOT NULL,
   WH_IN_DATETIME DATETIME NOT NULL,
   WH_OUT_DATETIME DATETIME NOT NULL
);

INSERT INTO #Table VALUES (1, 9, '2014-10-12 00:00:00', '2014-10-13 15:00:00');
INSERT INTO #Table VALUES (2, 9, '2014-10-14 14:00:00', '2014-10-15 15:00:00');
INSERT INTO #Table VALUES (3, 9, '2014-10-16 14:00:00', '2014-10-17 15:00:00');
INSERT INTO #Table VALUES (4, 9, '2014-11-20 00:00:00', '2014-11-21 00:00:00');
INSERT INTO #Table VALUES (5, 5, '2014-10-17 00:00:00', '2014-10-18 00:00:00');

Step 1: Create and Populate the Scratch Table

CREATE TABLE #Scratch
(
   ADM_ID INT NOT NULL PRIMARY KEY,
   EP_ID INT NOT NULL
   -- Might need WH_PID and WH_IN_DATETIME fields to guarantee proper UPDATE ordering
);

INSERT INTO #Scratch (ADM_ID, EP_ID)
   SELECT ADM_ID, 0
   FROM   #Table;

Alternate scratch table structure to ensure proper update order (since "quirky update" uses the order of the Clustered Index, as noted at the bottom of this answer):

CREATE TABLE #Scratch
(
   WH_PID INT NOT NULL,
   WH_IN_DATETIME DATETIME NOT NULL,
   ADM_ID INT NOT NULL,
   EP_ID INT NOT NULL
);

INSERT INTO #Scratch (WH_PID, WH_IN_DATETIME, ADM_ID, EP_ID)
   SELECT WH_PID, WH_IN_DATETIME, ADM_ID, 0
   FROM   #Table;

CREATE UNIQUE CLUSTERED INDEX [CIX_Scratch]
   ON #Scratch (WH_PID, WH_IN_DATETIME, ADM_ID);

Step 2: Update the Scratch Table using a local variable to keep track of the prior value

DECLARE @EP_ID INT; -- this is used in the UPDATE

;WITH cte AS
(
  SELECT TOP (100) PERCENT
         t1.*,
         t2.WH_OUT_DATETIME AS [PriorOut],
         t2.ADM_ID AS [PriorID],
         ROW_NUMBER() OVER (PARTITION BY t1.WH_PID ORDER BY t1.WH_IN_DATETIME)
                AS [RowNum]
  FROM   #Table t1
  LEFT JOIN #Table t2
         ON t2.WH_PID = t1.WH_PID
        AND t2.ADM_ID <> t1.ADM_ID
        AND t2.WH_OUT_DATETIME >= (t1.WH_IN_DATETIME - 1)
        AND t2.WH_OUT_DATETIME < t1.WH_IN_DATETIME
  ORDER BY t1.WH_PID, t1.WH_IN_DATETIME
)
UPDATE sc
SET    @EP_ID = sc.EP_ID = CASE
                               WHEN cte.RowNum = 1 THEN 1
                               WHEN cte.[PriorOut] IS NULL THEN (@EP_ID + 1)
                               ELSE @EP_ID
                        END
FROM   #Scratch sc
INNER JOIN cte
        ON cte.ADM_ID = sc.ADM_ID

Step 3: Select Joining the Scratch Table

SELECT tab.ADM_ID,
       tab.WH_PID,
       sc.EP_ID,
       MIN(tab.WH_IN_DATETIME) OVER (PARTITION BY tab.WH_PID, sc.EP_ID)
           AS [EP_IN_DATETIME],
       MAX(tab.WH_OUT_DATETIME) OVER (PARTITION BY tab.WH_PID, sc.EP_ID)
           AS [EP_OUT_DATETIME],
       tab.WH_IN_DATETIME,
       tab.WH_OUT_DATETIME
FROM   #Table tab
INNER JOIN #Scratch sc
    ON sc.ADM_ID = tab.ADM_ID
ORDER BY tab.ADM_ID;

Resources

  • MSDN page for UPDATE

    look for "@variable = column = expression"

  • Performance Analysis of doing Running Totals (not exactly the same thing as here, but not too far off)

    This blog post does mention:

    • PRO: this method is generally pretty fast
    • CON: "The order of the UPDATE is controlled by the order of the clustered index". This behavior might rule out using this method depending on circumstances. But in this particular case, if the WH_PID values are not at least grouped together naturally via the ordering of the clustered index and ordered by WH_IN_DATETIME, then those two fields just get added to the scratch table and the PK (with implied clustered index) on the scratch table becomes (WH_PID, WH_IN_DATETIME, ADM_ID).
like image 89
Solomon Rutzky Avatar answered Sep 21 '22 21:09

Solomon Rutzky