This is more of a puzzle really. Its probably been asked elsewhere before, but I couldn't find anything so I thought I'd share the question.
I'm trying to implement some kind of load balancing in an application and have reduced the problem down to what I believe should be a simple TSQL exercise (the application is predominantly in the SQL Server domain (SQL Server 2008 R2)).
Basically I have a table with two integers; a unique, sequential Id and a non-unique Value. The table could hold any number of records and I'd like to produce a table of data where the first n largest Values are split into separate 'groupings' and then the second set of n largest Values are split into separate 'groupings'.
I've got a first draft working below but I believe it can be improved...
DECLARE @GroupCount INT = 5
-- Set up the test data
DECLARE @test TABLE (Id INT IDENTITY(1, 1), Value INT)
INSERT @Test (Value)
VALUES (100), (456), (121), (402), (253), (872), (765), (6529), (1029), (342), (98), (1), (0), (4), (46), (23), (456), (416), (2323), (4579)
--Order by Value descending
;WITH cte AS
(
SELECT *
,ROW_NUMBER() OVER (ORDER BY Value DESC) RowNum
FROM @Test
)
--use modulus to split into grouping
, cte2 AS
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY RowNum % @GroupCount ORDER BY RowNum DESC) Rnk
FROM cte
)
SELECT ROW_NUMBER() OVER (PARTITION BY Rnk ORDER BY Value DESC) AS 'Grouping'
,Value
,Id
FROM cte2
ORDER BY [Grouping], Value ASC
This works and produces the following dataset:
Grouping, Value, Id
======== ===== ==
1 46 15
1 342 10
1 765 7
1 6529 8
2 23 16
2 253 5
2 456 2
2 4579 20
3 4 14
3 121 3
3 456 17
3 2323 19
4 1 12
4 100 1
4 416 18
4 1029 9
5 0 13
5 98 11
5 402 4
5 872 6
The dataset returned is correct in that the first n largest values are split into separate groupings and so on but the total values in each grouping are quite different in grouping 1 compared to grouping 5 (for example).
When grouped and SUMmed we can see the un-even spread:
Grouping, SummedValues
======== ============
1 7682
2 5311
3 2904
4 1546
5 1372
In as few lines as possible how can I better balance the Values so that the total Values in each grouping is more evenly spread?
This is flawed, but not terrible for given the example data. Your mileage may vary.
declare @groupcount int = 5;
create table t (id int identity(1, 1), value int);
insert t values
(100),(456),(121),(402),(253),(872),(765),(6529),(1029),(342)
, (98),(1),(0),(4),(46),(23),(456),(416),(2323),(4579);
;with cte as (
select *
, rn = row_number() over (order by value asc)
, pct = value/sum(value+.0) over()
, target = 1.0 / @groupcount
from t
)
, remaining as (
select id, value, rn
, grp = convert(int,(sum(value) over (order by rn)/sum(value+.0) over())*@groupCount)+1
from cte
)
select
grp = row_number() over (order by sum(value) desc)
, sumValue = sum(value)
from remaining
group by grp
rextester demo: http://rextester.com/UNV61100
results:
+-----+----------+
| grp | sumValue |
+-----+----------+
| 1 | 6529 |
| 2 | 4579 |
| 3 | 3483 |
| 4 | 2323 |
| 5 | 1901 |
+-----+----------+
declare @groupcount int = 5;
create table t (id int identity(1, 1), value int);
insert t values
(100),(456),(121),(402),(253),(872),(765),(6529),(1029),(342)
, (98),(1),(0),(4),(46),(23),(456),(416),(2323),(4579);
;with cte as (
select *
, rn = row_number() over (order by value asc)
, pct = value/tv.TotalValue
, target = 1.0 / @groupcount
from t
cross join (select TotalValue = sum(value+.0) from t) tv
)
, remaining as (
select id, value, rn
, grp = convert(int,((x.sumValueOver/TotalValue)*@groupcount)+1)
from cte
outer apply (
select sumValueOver = sum(value)
from cte i
where i.rn <= cte.rn
) x
)
select
grp = row_number() over (order by sum(value) desc)
, sumValue = sum(value)
from remaining
group by grp
rextester demo: http://rextester.com/DEUDJ77007
returns:
+-----+----------+
| grp | sumValue |
+-----+----------+
| 1 | 6529 |
| 2 | 4579 |
| 3 | 3483 |
| 4 | 2323 |
| 5 | 1901 |
+-----+----------+
Here NTILE
function in sql server may help you.
DECLARE @GroupCount INT = 5
-- Set up the test data
DECLARE @test TABLE (Id INT IDENTITY(1, 1), Value INT)
INSERT @Test (Value)
SELECT 100
UNION ALL
SELECT 456
UNION ALL
SELECT 121
UNION ALL
SELECT 402
UNION ALL
SELECT 253
UNION ALL
SELECT 872
UNION ALL
SELECT 765
UNION ALL
SELECT 6529
UNION ALL
SELECT 1029
UNION ALL
SELECT 342
UNION ALL
SELECT 98
UNION ALL
SELECT 1
UNION ALL
SELECT 0
UNION ALL
SELECT 4
UNION ALL
SELECT 46
UNION ALL
SELECT 23
UNION ALL
SELECT 456
UNION ALL
SELECT 416
UNION ALL
SELECT 2323
UNION ALL
SELECT 4579
;With cte
AS
(
SELECT *, NTILE(@GroupCount) OVER(ORDER BY Value DESC) AS GroupNo FROM @Test
)
SELECT GroupNo, SUM(Value) AS SummedValues FROM cte
GROUP BY GroupNo
and i get this result.
GroupNo SummedValues
--------------------
1 14460
2 2549
3 1413
4 365
5 28
A slightly better way to do this would be to "snake" the selections. You're lining up the 1st, 6th, 11th highest - of course that's way higher than 5th, 10th, 15th.
Better would be 1st, 10th, 11th, versus 5th, 6th, 15th. Still not perfect, and with your particular data still very poor, but slightly better than yours.
DECLARE @GroupCount INT = 5
-- Set up the test data
DECLARE @test TABLE (Id INT IDENTITY(1, 1), Value INT)
INSERT @Test (Value)
SELECT 100
UNION ALL
SELECT 456
UNION ALL
SELECT 121
UNION ALL
SELECT 402
UNION ALL
SELECT 253
UNION ALL
SELECT 872
UNION ALL
SELECT 765
UNION ALL
SELECT 6529
UNION ALL
SELECT 1029
UNION ALL
SELECT 342
UNION ALL
SELECT 98
UNION ALL
SELECT 1
UNION ALL
SELECT 0
UNION ALL
SELECT 4
UNION ALL
SELECT 46
UNION ALL
SELECT 23
UNION ALL
SELECT 456
UNION ALL
SELECT 416
UNION ALL
SELECT 2323
UNION ALL
SELECT 4579
--Order by Value descending
;WITH cte AS
(
SELECT *
,ROW_NUMBER() OVER (ORDER BY Value DESC) RowNum
FROM @Test
)
--use modulus to split into grouping
, cte2 AS
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY RowNum % (@GroupCount*2 ) ORDER BY RowNum DESC) Rnk
FROM cte
)
select [Grouping], SUM(value) from (
SELECT floor(abs(@GroupCount - (ROW_NUMBER() OVER (PARTITION BY Rnk ORDER BY Value DESC) - 0.5)) + 0.5) AS 'Grouping'
,Value
,Id
FROM cte2
--ORDER BY [Grouping], Value ASC
) a group by [Grouping]
order by [Grouping] ASC
Ultimately though I think random assignment is likely better than this, maybe random assignment while checking that the sum isn't already 2*(1/grouping * total).
Really I think this is not a problem well solved by TSQL or any SQL; languages that can control flow on a row by row basis will better serve you. Python, C#, SAS, whatever other tool that is in your toolbox. (PL/SQL is the one place I'd consider going here...)
Anything that would let you say, on a row-level basis, "Having kept track of what I've assigned so far, assign this particular case to the bucket with the lowest number so far" really would work better.
Grouping Summed Values
---------------------
1 1781
2 1608
3 2904
4 5249
5 7273
Using the ntile
and the row_number
window functions together to not only split it into the even groups (even by count, not sum) but make a better decision of what values to include in each group to even out the total in each group as much as possible.
Answer:
select case b.grp_split when 1 then b.grp_split_rnk_desc else grp_split_rnk_asc end as [grouping]
, b.value
, b.id
from (
select a.id
, a.value
, a.grp_split
, row_number() over (partition by a.grp_split order by a.value desc) grp_split_rnk_desc
, row_number() over (partition by a.grp_split order by a.value asc) grp_split_rnk_asc
from (
select t.id
, t.value
, ntile(@ntile_cnt) over (order by t.value desc) as grp_split
from @test as t
) as a
) as b
order by case b.grp_split when 1 then b.grp_split_rnk_desc else grp_split_rnk_asc end asc
, b.value asc
Results:
Not perfect, but slightly closer.
Group Total
1 7029
2 5096
3 2904
4 1761
5 2025
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With