Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using GROUP BY with FIRST_VALUE and LAST_VALUE

I'm working with some data that is currently stored in 1 minute intervals that looks like this:

CREATE TABLE #MinuteData     (       [Id] INT ,       [MinuteBar] DATETIME ,       [Open] NUMERIC(12, 6) ,       [High] NUMERIC(12, 6) ,       [Low] NUMERIC(12, 6) ,       [Close] NUMERIC(12, 6)     );  INSERT  INTO #MinuteData         ( [Id], [MinuteBar], [Open], [High], [Low], [Close] ) VALUES  ( 1, '2015-01-01 17:00:00', 1.557870, 1.557880, 1.557870, 1.557880 ),         ( 2, '2015-01-01 17:01:00', 1.557900, 1.557900, 1.557880, 1.557880 ),         ( 3, '2015-01-01 17:02:00', 1.557960, 1.558070, 1.557960, 1.558040 ),         ( 4, '2015-01-01 17:03:00', 1.558080, 1.558100, 1.558040, 1.558050 ),         ( 5, '2015-01-01 17:04:00', 1.558050, 1.558100, 1.558020, 1.558030 ),         ( 6, '2015-01-01 17:05:00', 1.558580, 1.558710, 1.557870, 1.557950 ),         ( 7, '2015-01-01 17:06:00', 1.557910, 1.558120, 1.557910, 1.557990 ),         ( 8, '2015-01-01 17:07:00', 1.557940, 1.558250, 1.557940, 1.558170 ),         ( 9, '2015-01-01 17:08:00', 1.558140, 1.558200, 1.558080, 1.558120 ),         ( 10, '2015-01-01 17:09:00', 1.558110, 1.558140, 1.557970, 1.557970 );  SELECT  * FROM    #MinuteData;  DROP TABLE #MinuteData; 

The values track currency exchange rates, so for each minute interval (bar), there is the Open price as the minute started and a Close price for the minute end. The High and Low values represent the highest and lowest rate during each individual minute.

Desired Output

I'm looking to reformat this data in to 5 minute intervals to produce the following output:

MinuteBar                Open       Close       Low         High 2015-01-01 17:00:00.000  1.557870   1.558030    1.557870    1.558100 2015-01-01 17:05:00.000  1.558580   1.557970    1.557870    1.558710 

This takes the Open value from the first minute of the 5, the Close value from the last minute of the 5. The High and Low values represent the highest high and lowest low rates across the 5 minute period.

Current Solution

I have a solution that does this (below), but it feels inelegant as it relies on id values and self joins. Also, I intend to run it on much larger datasets so I was looking to do it in a more efficient manner if possible:

-- Create a column to allow grouping in 5 minute Intervals SELECT  Id, MinuteBar, [Open], High, Low, [Close],  DATEDIFF(MINUTE, '2015-01-01T00:00:00', MinuteBar)/5 AS Interval INTO    #5MinuteData FROM    #MinuteData ORDER BY minutebar  -- Group by inteval and aggregate prior to self join SELECT  Interval ,         MIN(MinuteBar) AS MinuteBar ,         MIN(Id) AS OpenId ,         MAX(Id) AS CloseId ,         MIN(Low) AS Low ,         MAX(High) AS High INTO    #DataMinMax FROM    #5MinuteData GROUP BY Interval;  -- Self join to get the Open and Close values SELECT  t1.Interval ,         t1.MinuteBar ,         tOpen.[Open] ,         tClose.[Close] ,         t1.Low ,         t1.High FROM    #DataMinMax t1         INNER JOIN #5MinuteData tOpen ON tOpen.Id = OpenId         INNER JOIN #5MinuteData tClose ON tClose.Id = CloseId;  DROP TABLE #DataMinMax DROP TABLE #5MinuteData 

Rework Attempt

Instead of the above queries, I've been looking at using FIRST_VALUE and LAST_VALUE, as it seems to be what I'm after, but I can't quite get it working with the grouping that I'm doing. There might be a better solution than what I'm trying to do, so I'm open to suggestions. Currently I'm trying to do this:

SELECT  MIN(MinuteBar) MinuteBar5 ,         FIRST_VALUE([Open]) OVER (ORDER BY MinuteBar) AS Opening,         MAX(High) AS High ,         MIN(Low) AS Low ,         LAST_VALUE([Close]) OVER (ORDER BY MinuteBar) AS Closing ,         DATEDIFF(MINUTE, '2015-01-01 00:00:00', MinuteBar) / 5 AS Interval FROM    #MinuteData GROUP BY DATEDIFF(MINUTE, '2015-01-01 00:00:00', MinuteBar) / 5 

This gives me the below error, which is related to the FIRST_VALUE and LAST_VALUE as the query runs if I remove those lines:

Column '#MinuteData.MinuteBar' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

like image 745
Tanner Avatar asked Jan 24 '17 23:01

Tanner


People also ask

Can we use partition by and GROUP BY together?

Therefore, in conclusion, the PARTITION BY retrieves all the records in the table, while the GROUP BY only returns a limited number. One more thing is that GROUP BY does not allow to add columns which are not parts of GROUP BY clause in select statement. However, with PARTITION BY clause, we can add required columns.

Can we use GROUP BY and having clause together?

HAVING Clause always utilized in combination with GROUP BY Clause. HAVING Clause restricts the data on the group records rather than individual records. WHERE and HAVING can be used in a single query.

Can GROUP BY be used with aggregate functions?

The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.

Can you use GROUP BY without aggregation?

GROUP BY in SQL, Explained And data aggregation is impossible without GROUP BY! Therefore, it is important to master GROUP BY to easily perform all types of data transformations and aggregations. In SQL, GROUP BY is used for data aggregation, using aggregate functions.


2 Answers

SELECT      MIN(MinuteBar) AS MinuteBar5,     Opening,     MAX(High) AS High,     MIN(Low) AS Low,     Closing,     Interval FROM  (     SELECT FIRST_VALUE([Open]) OVER (PARTITION BY DATEDIFF(MINUTE, '2015-01-01 00:00:00', MinuteBar) / 5 ORDER BY MinuteBar) AS Opening,            FIRST_VALUE([Close]) OVER (PARTITION BY DATEDIFF(MINUTE, '2015-01-01 00:00:00', MinuteBar) / 5 ORDER BY MinuteBar DESC) AS Closing,            DATEDIFF(MINUTE, '2015-01-01 00:00:00', MinuteBar) / 5 AS Interval,            *     FROM #MinuteData ) AS T GROUP BY Interval, Opening, Closing 

A solution close to your current one. There are two places you did wrong.

  1. FIRST_VALUE AND LAST_VALUE are Analytic Functions, which work on a window or partition, instead of a group. You can run the nested query alone and see its result.

  2. LAST_VALUE is the last value of current window, which is not specified in your query, and a default window is rows from the first row of current partition to current row. You can either use FIRST_VALUE with descending order or specify a window

    LAST_VALUE([Close]) OVER (PARTITION BY DATEDIFF(MINUTE, '2015-01-01 00:00:00', MinuteBar) / 5              ORDER BY MinuteBar              ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Closing, 
like image 146
qxg Avatar answered Sep 22 '22 00:09

qxg


Here is one way to do it without temporary tables:

;WITH CTEInterval AS  (  -- This replaces your first temporary table (#5MinuteData)     SELECT  [Id],              [MinuteBar],              [Open],              [High],              [Low],              [Close],             DATEPART(MINUTE, MinuteBar)/5 AS Interval     FROM #MinuteData ), CTEOpenClose as  ( -- this is instead of your second temporary table (#DataMinMax)     SELECT  [Id],              [MinuteBar],              FIRST_VALUE([Open]) OVER (PARTITION BY Interval ORDER BY MinuteBar) As [Open],             [High],             [Low],              FIRST_VALUE([Close]) OVER (PARTITION BY Interval ORDER BY MinuteBar DESC) As [Close],             Interval     FROM CTEInterval )  -- This is the final select SELECT  MIN([MinuteBar]) as [MinuteBar],          AVG([Open]) as [Open], -- All values of [Open] in the same interval are the same...         AVG([Close]) as [Close],  -- All values of [Close] in the same interval are the same...         MIN([Low]) as [Low],          MAX([High]) as [High] FROM CTEOpenClose GROUP BY Interval 

Results:

MinuteBar                Open       Close       Low         High 2015-01-01 17:00:00.000  1.557870   1.558030    1.557870    1.558100 2015-01-01 17:05:00.000  1.558580   1.557970    1.557870    1.558710 
like image 40
Zohar Peled Avatar answered Sep 22 '22 00:09

Zohar Peled