Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I normalize the capitalization of a group-by column?

On SQL Server configured as case-insensitive, group by can have interesting results when the [n][var]char column is not the first group by column. Essentially, it looks like whatever row it encounters "first" (where "first" is undefined in the absence of an order): wins for that grouping. For example:

select x.[day], x.[name], count(1) as [count]
from (
    select 1 as [day], 'a' as [name]
    union all select 1, 'A'
    union all select 2, 'A'
    union all select 2, 'a'
    ) x group by x.[day], x.[name]

which returns, for me:

day         name count
----------- ---- -----------
1           A    2
2           a    2

Using min(x.[name]) has no effect, since the grouping already happened.

I can't add an order by before the group by, as that is illegal; and adding the order by after the group by just defines the output order after the grouping - it still gives a and A.

So: is there a sane way of doing this where the capitalization will at least be consistent for all the groupings? (I'll leave for another day the problem of being consistent for separate runs)

Desired output, either:

day         name count
----------- ---- -----------
1           A    2
2           A    2

or:

day         name count
----------- ---- -----------
1           a    2
2           a    2

Edit: without destroying capitalisation when it is consistent between groups. So no upper/lower. So if one of the groups consistently has the value BcDeF, I want the result of that row to be BcDeF, not bcdef or BCDEF.

like image 722
Marc Gravell Avatar asked Sep 22 '16 11:09

Marc Gravell


1 Answers

I would use windowing functions for this. By using ROW_NUMBER and partitioning using a case insensitive collation, but ordering by a case sensitive one, we will choose consistently one result with the original capitalisation, but it will group them as if they are the same:

WITH CTE AS
(
    SELECT  *,
            RN = ROW_NUMBER() OVER(PARTITION BY [day], [name]
                                   ORDER BY [name] COLLATE SQL_Latin1_General_Cp1_Cs_AS),
            N = COUNT(*) OVER(PARTITION BY [day], [name])
    FROM (  select 1 as [day], 'a' as [name]
            union all select 1, 'A'
            union all select 2, 'A'
            union all select 2, 'a'
            union all select 3, 'BcDeF'
            union all select 3, 'bCdEf') X
)
SELECT *
FROM CTE
WHERE RN = 1;

It returns:

╔═════╦═══════╦════╦═══╗
║ day ║ name  ║ RN ║ N ║
╠═════╬═══════╬════╬═══╣
║   1 ║ A     ║  1 ║ 2 ║
║   2 ║ A     ║  1 ║ 2 ║
║   3 ║ BcDeF ║  1 ║ 2 ║
╚═════╩═══════╩════╩═══╝

Following @AndriyM's comment, if you want the same capitalisation over the whole result set, and not just the same day, you can use:

WITH CTE AS
(
    SELECT  *,
            RN = ROW_NUMBER() OVER(PARTITION BY [day], [name]
                                   ORDER BY [name] COLLATE SQL_Latin1_General_Cp1_Cs_AS),
            N = COUNT(*) OVER(PARTITION BY [day], [name])
    FROM (  select 1 as [day], 'a' as [name]
            union all select 1, 'A'
            union all select 2, 'A'
            union all select 2, 'a'
            union all select 3, 'BcDeF'
            union all select 3, 'bCdEf') X
)
SELECT  [day],
        MAX([name] COLLATE SQL_Latin1_General_Cp1_CS_AS) OVER (PARTITION BY [name]) [name],
        N
FROM CTE
WHERE RN = 1;
like image 90
Lamak Avatar answered Sep 28 '22 02:09

Lamak