Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create evenly sized groups based on aggregate

Probably a newbie question, but I'm looking to split our server inventory up into several evenly sized groups based on total database size, and am stumped figuring out how to group them. I think NTILE will work, maybe, but I just can't wrap my head around splitting the groups evenly. My example below is just ordering the servers randomly. I would like the results to be 3 groups of fairly even size (obviously won't be exact).

Using SQL Server 2012. Any help is appreciated. Thanks.

declare @Servers table (ServerName sysname, TotalSizeGB decimal (12,2))
insert into @Servers values
('Server1',123.45),
('Server2',234.56),
('Server3',345.67),
('Server4',456.78),
('Server5',567.89),
('Server6',678.90),
('Server7',789.01),
('Server8',890.12),
('Server9',901.23),
('Server10',1023.35)

select GroupNumber, sum(TotalSizeGB) as TotalSizeGB
from (
     select ServerName, sum(TotalSizeGB) as TotalSizeGB, ntile(3) over (order by newid()) as GroupNumber
     from (
          select ServerName, TotalSizeGB from @Servers
          ) x 
     group by ServerName
     ) y
group by GroupNumber

The expected output here would be three groups of about 2000GB each. I expect it won't be exact, but at least close. If grouping per server, it might look like this:

ServerName  TotalSizeGB GroupNumber
Server10    1023.35 1  
Server1 123.45  1
Server5 567.89  1
Server3 345.67  1
Server4 456.78  2
Server7 789.01  2
Server6 678.90  2
Server2 234.56  3
Server9 901.23  3
Server8 890.12  3

If I was taking a sum per group, it would look like this:

GroupNumber TotalSizeGB
1   2060.36
2   1924.69
3   2025.91
like image 362
FilamentUnities Avatar asked Jul 26 '13 19:07

FilamentUnities


3 Answers

SELECT  *
FROM(
    SELECT  y.TotalSizeGB,
            CASE 
                WHEN y.AnotherGrp%2=0 AND y.PseudoGrpNumber=0 THEN 2
                WHEN y.AnotherGrp%2=0 AND y.PseudoGrpNumber=1 THEN 1
                WHEN y.AnotherGrp%2=0 AND y.PseudoGrpNumber=2 THEN 0
                ELSE y.PseudoGrpNumber
            END GrpNumber
    FROM(
        SELECT 
            x.ServerName,
            x.TotalSizeGB,
            (2+ROW_NUMBER() OVER(ORDER BY x.TotalSizeGB DESC))%3 PseudoGrpNumber,
            (2+ROW_NUMBER() OVER(ORDER BY x.TotalSizeGB DESC))/3 AnotherGrp,
            ROW_NUMBER() OVER(ORDER BY x.TotalSizeGB DESC) RowNum
        FROM    @Servers x
    )y
)z
PIVOT( SUM(z.TotalSizeGB) FOR z.GrpNumber IN([0],[1],[2]) ) pvt;

Results:

0       1       2
------- ------- -------
2048.02 1925.80 2037.14

Some explanations:

The idea is to sort data descending on TotalSizeGB column. Then every 3 sequential rows are grouped together (column AnotherGrp) first in DESC order and then in ASC order (column PseudoGroNumber and GrpNumber). If it's executed SELECT * FROM () y derivate table then the results will be:

ServerName TotalSizeGB  PseudoGrpNumber AnotherGrp GrpNumber RowNum
---------- ------------ --------------- ---------- --------- ------
Server10   1023.35      0               1          0         1
Server9    901.23       1               1          1         2
Server8    890.12       2               1          2         3
Server7    789.01       0               2          2         4
Server6    678.90       1               2          1         5
Server5    567.89       2               2          0         6
Server4    456.78       0               3          0         7
Server3    345.67       1               3          1         8
Server2    234.56       2               3          2         9
Server1    123.45       0               4          2         10
like image 121
Bogdan Sahlean Avatar answered Sep 17 '22 17:09

Bogdan Sahlean


This task is scientific actually (Packing problem, or a kind of), and may be better suits math.stackexchange :)

My solution is in two steps (as many optimization problems are) - find some initial solution and try to refine it.

Initial solution:

ServerName GroupNo     TotalSizeGB
---------- ----------- -----------
Server1    3           123.45
Server2    3           234.56
Server3    2           345.67
Server4    1           456.78
Server5    2           567.89
Server6    1           678.90
Server7    3           789.01
Server8    3           890.12
Server9    1           901.23
Server10   2           1023.35

GroupNo     GroupSizeGb
----------- -----------
1           2036.91
2           1936.91
3           2037.14

Optimized:

ServerName GroupNo     TotalSizeGB
---------- ----------- -----------
Server1    3           123.45
Server2    3           234.56
Server3    2           345.67
Server4    1           456.78
Server5    3           567.89
Server6    1           678.90
Server7    2           789.01
Server8    2           890.12
Server9    1           901.23
Server10   3           1023.35

GroupNo     GroupSizeGb
----------- -----------
1           2036.91
2           2024.80
3           1949.25

Unfortunately, I was not able to set it up on SQLFiddle, because of explicit transactions are used.

set nocount on

-- Parameters
declare
  @nGroups int, -- Number of groups to split servers to
  @tolerance float, -- let say 0.0 ... 0.1 (0.1 mean that (+/-)10% deviation allowed from target group size)
  @nTries int, -- refinement tries 100, 1000, 10000 or as much as you can wait if you are not satisfied with initial solution
  @mFactor float, -- refinement param 0.0 ... 1.0
  @tolerance2 float -- let say 0.1 ... 0.3

set @nGroups = 3
set @tolerance = 0
set @nTries = 1000
set @mFactor = 0.3
set @tolerance2 = 0.3


-- Initial Data
create table #Servers (ID int identity, ServerName sysname, TotalSizeGB decimal (12,2), primary key clustered(ID))

insert into #Servers (ServerName, TotalSizeGB) values
('Server1',123.45),
('Server2',234.56),
('Server3',345.67),
('Server4',456.78),
('Server5',567.89),
('Server6',678.90),
('Server7',789.01),
('Server8',890.12),
('Server9',901.23),
('Server10',1023.35)

create table #Groups (GroupNo int not NULL, primary key clustered (GroupNo))
insert into #Groups (GroupNo)
select N from (select row_number() over (order by @@spid) from sys.all_columns) S(N) where N <= @nGroups

create table #ServerGroups (ServerID int not NULL, GroupNo int not NULL, primary key clustered(ServerID))
create index #IX_GroupServers_GroupNo on #ServerGroups (GroupNo)

declare
    @srvCnt int,
    @grSize decimal (12,2),
    @grNo int,
    @grSz decimal (12,2),
    @srvID int

select @srvCnt = count(1), @grSize = sum(TotalSizeGB) / @nGroups from #Servers
select @grSize as [Target approx. group size]

-- Find initial solution
while (select count(1) from #ServerGroups) < @srvCnt
begin
    select top 1 @grNo = g.GroupNo
    from #Groups g
        left join #ServerGroups sg on sg.GroupNo = g.GroupNo
        left join #Servers s on s.ID = sg.ServerID
    group by g.GroupNo
    order by sum(s.TotalSizeGB)

    select @grSz = IsNull(sum(s.TotalSizeGB), 0)
    from #Groups g
        left join #ServerGroups sg on sg.GroupNo = g.GroupNo
        left join #Servers s on s.ID = sg.ServerID
    where g.GroupNo = @grNo

    select top 1 @srvID = ID
    from #Servers s
    where not exists (select 1 from #ServerGroups where ServerID = s.ID)
    order by abs(@grSize - @grSz - s.TotalSizeGB)

    insert into #ServerGroups (ServerID, GroupNo) values (@srvID, @grNo)
end

select g.GroupNo, SUM(s.TotalSizeGB) GroupSizeGb
from #Groups g
    join #ServerGroups sg on sg.GroupNo = g.GroupNo
    join #Servers s on s.ID = sg.ServerID
group by g.GroupNo


-- Refinement
declare @fTarg float

select @fTarg = sum(abs(case when abs(re) > @tolerance then re else 0 end))
from (
    select g.GroupNo, SUM(s.TotalSizeGB) GroupSizeGb
    from #Groups g
        join #ServerGroups sg on sg.GroupNo = g.GroupNo
        join #Servers s on s.ID = sg.ServerID
    group by g.GroupNo
) t
cross apply (select (GroupSizeGb - @grSize)/@grSize re) p

print @fTarg

if @fTarg > 0
begin

create table #MServerGroups (ServerID int not NULL, GroupNo int not NULL, primary key clustered (ServerID))
insert into #MServerGroups
select ServerID, GroupNo from #ServerGroups

while @nTries > 0
begin
    set @nTries = @nTries - 1

    begin transaction

    ;with MS as (
        select top (100*@mFactor) percent ServerID, GroupNo
        from #MServerGroups
        order by checksum(newid())
    )
    update msg
    set
        msg.GroupNo = case when msg.ServerID = tt.ServerID1 then tt.NewNo1 else tt.NewNo2 end
    from
        #MServerGroups msg
        join (
            select ServerID1, NewNo1, ServerID2, NewNo2
            from (
                select MS.ServerID as ServerID1, SS.GroupNo as NewNo1, SS.ServerID as ServerID2, MS.GroupNo as NewNo2, row_number() over (partition by SS.ServerID order by @@spid) as rn
                from MS
                    join #Servers s on s.ID = MS.ServerID
                    cross apply (
                        select top 1 *
                        from
                            #Servers s2
                            join #MServerGroups ms2 on ms2.ServerID = s2.ID
                        where
                            s2.ID != MS.ServerID and ms2.GroupNo != MS.GroupNo and abs(s2.TotalSizeGB - s.TotalSizeGB)/s.TotalSizeGB < @tolerance2
                        order by checksum(newid())
                    ) SS
            ) t
            where rn = 1
        )tt on msg.ServerID in (tt.ServerID1, tt.ServerID2)

    if @@rowcount = 0
    begin
        rollback transaction
        continue;
    end

    declare @fT float

    select @fT = sum(abs(case when abs(re) > @tolerance then re else 0 end))
    from (
        select g.GroupNo, SUM(s.TotalSizeGB) GroupSizeGb
        from #Groups g
            join #MServerGroups sg on sg.GroupNo = g.GroupNo
            join #Servers s on s.ID = sg.ServerID
        group by g.GroupNo
    ) t
    cross apply (select (GroupSizeGb - @grSize)/@grSize re) p

    if @fT < @fTarg
    begin
        set @fTarg = @ft
        print @fTarg -- the less this number, the better solution is

        commit transaction
    end
    else
        rollback transaction
end

update s
set s.GroupNo = m.GroupNo
from #MServerGroups m
    join #ServerGroups s on s.ServerID = m.ServerID

select g.GroupNo, SUM(s.TotalSizeGB) GroupSizeGb
from #Groups g
    join #ServerGroups sg on sg.GroupNo = g.GroupNo
    join #Servers s on s.ID = sg.ServerID
group by g.GroupNo

drop table #MServerGroups

end
else
    print 'No refinement needed'

drop table #Groups
drop table #ServerGroups
drop table #Servers

I suggest to start with @nTries = 0 and reasonable @tolerance (e.g. 0.1, 0.05).

like image 25
i-one Avatar answered Sep 16 '22 17:09

i-one


Here's a solution which generates the same results as @i-one's code but perhaps a little easier to understand (at least for me). I use 'chunk' instead of 'group' to avoid keyword conflicts.

The premise is as follows. To create n evenly-sized chunks:

  1. Sort all records in decreasing order
  2. Assign the first n records to their chunks by row number
  3. Loop through the remainder, always assigning to the smallest chunk

I've uploaded the code to SQLFiddle but it doesn't seem to like the table variable. Here's the link anyways.

-- Source data:
DECLARE @Servers TABLE (ServerName SYSNAME, TotalSizeGB DECIMAL (12,2))
INSERT INTO @Servers VALUES
('Server1',123.45),
('Server2',234.56),
('Server3',345.67),
('Server4',456.78),
('Server5',567.89),
('Server6',678.90),
('Server7',789.01),
('Server8',890.12),
('Server9',901.23),
('Server10',1023.35)


-- Solution start
DECLARE @ServersChunked TABLE (
        ServerName SYSNAME,
        TotalSizeGB DECIMAL (12,2),
        RowNum INT,
        ChunkNo INT
    );
DECLARE
    @ChunkCount INT = 3,
    @MinRowNum INT,
    @SmallestChunk INT;


-- Copy table into variable (skip this if the original table can be amended to include the RowNum and ChunkNo fields)
INSERT INTO @ServersChunked
SELECT 
    *, 
    RowNum = ROW_NUMBER() OVER (ORDER BY TotalSizeGB DESC), 
    ChunkNo = NULL
FROM @Servers

-- Assign the initial chunks to largest tables
UPDATE @ServersChunked
SET ChunkNo = RowNum
WHERE RowNum <= @ChunkCount


-- Assign chunks to remaining tables
WHILE EXISTS (SELECT 1 FROM @ServersChunked WHERE ChunkNo IS NULL) BEGIN

    -- Find the next table (by descending row count)
    SELECT @MinRowNum = MIN(RowNum) FROM @ServersChunked WHERE ChunkNo IS NULL

    -- Find the smallest chunk
    SELECT TOP 1 @SmallestChunk = ChunkNo
    FROM @ServersChunked
    WHERE ChunkNo IS NOT NULL
    GROUP BY ChunkNo
    ORDER BY Sum(TotalSizeGB) ASC

    -- Assign the table to the chunk
    UPDATE @ServersChunked
    SET ChunkNo = @SmallestChunk
    WHERE RowNum = @MinRowNum
END

Here are the results:

ChunkNo SumTotalSizeGB
1       1936.91
2       2036.91
3       2037.14
like image 39
maccaroo Avatar answered Sep 18 '22 17:09

maccaroo