Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping and counting rows by value until it changes

I have a table where messages are stored as they happen. Usually there is a message 'A' and sometimes the A's are separated by a single message 'B'. Now I want to group the values so I'm able to analyze them, for example finding longest 'A'-streak or distribution of 'A'-streaks.

I already tried a COUNT-OVER query but that keeps on counting for each message.

SELECT message, COUNT(*) OVER (ORDER BY Timestamp RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

This is my example data:

Timestamp        Message
20150329 00:00   A
20150329 00:01   A
20150329 00:02   B
20150329 00:03   A
20150329 00:04   A
20150329 00:05   A
20150329 00:06   B

I want following output

Message    COUNT
A          2
B          1
A          3
B          1
like image 530
dwonisch Avatar asked Mar 29 '15 09:03

dwonisch


People also ask

Can we use count and GROUP BY together?

The use of COUNT() function in conjunction with GROUP BY is useful for characterizing our data under various groupings. A combination of same values (on a column) will be treated as an individual group.

How do I count rows after GROUP BY in SQL?

To count the number of rows, use the id column which stores unique values (in our example we use COUNT(id) ). Next, use the GROUP BY clause to group records according to columns (the GROUP BY category above). After using GROUP BY to filter records with aggregate functions like COUNT, use the HAVING clause.

Which function counts the number of rows in a group?

The SQL COUNT() function returns the number of rows in a table satisfying the criteria specified in the WHERE clause. It sets the number of rows or non NULL column values. COUNT() returns 0 if there were no matching rows.

How do I count the number of rows with a specific value in SQL?

Use the COUNT aggregate function to count the number of rows in a table. This function takes the name of the column as its argument (e.g., id ) and returns the number of rows for this particular column in the table (e.g., 5).


2 Answers

That was interesting :)

;WITH cte as (
SELECT Messages.Message, Timestamp, 
ROW_NUMBER() OVER(PARTITION BY Message ORDER BY Timestamp) AS gn,
ROW_NUMBER() OVER (ORDER BY Timestamp) AS rn
FROM Messages
), cte2 AS (
SELECT Message, Timestamp, gn, rn, gn - rn  as gb
FROM cte 
), cte3 AS (
SELECT Message, MIN(Timestamp) As Ts, COUNT(1) as Cnt
FROM cte2
GROUP BY Message, gb)
SELECT Message, Cnt FROM cte3
ORDER BY Ts

Here is the result set:

  Message   Cnt
    A   2
    B   1
    A   3
    B   1

The query may be shorter but I post it that way so you can see what's happening. The result is exactly as requested. This is the most important part gn - rn the idea is to number the rows in each partition and at the same time number the rows in the whole set then if you subtract the one from the other you'll get the 'rank' of each group.

;WITH cte as (
SELECT Messages.Message, Timestamp, 
ROW_NUMBER() OVER(PARTITION BY Message ORDER BY Timestamp) AS gn,
ROW_NUMBER() OVER (ORDER BY Timestamp) AS rn
FROM Messages
), cte2 AS (
SELECT Message, Timestamp, gn, rn, gn - rn  as gb
FROM cte 
)
SELECT * FROM cte2

Message Timestamp           gn  rn  gb
A   2015-03-29 00:00:00.000 1   1   0
A   2015-03-29 00:01:00.000 2   2   0
B   2015-03-29 00:02:00.000 1   3   -2
A   2015-03-29 00:03:00.000 3   4   -1
A   2015-03-29 00:04:00.000 4   5   -1
A   2015-03-29 00:05:00.000 5   6   -1
B   2015-03-29 00:06:00.000 2   7   -5
like image 156
Mihail Shishkov Avatar answered Nov 06 '22 21:11

Mihail Shishkov


Here is a little bit smaller solution:

DECLARE @t TABLE ( d DATE, m CHAR(1) )

INSERT  INTO @t
VALUES  ( '20150301', 'A' ),
        ( '20150302', 'A' ),
        ( '20150303', 'B' ),
        ( '20150304', 'A' ),
        ( '20150305', 'A' ),
        ( '20150306', 'A' ),
        ( '20150307', 'B' );

WITH 
c1 AS(SELECT d, m, IIF(LAG(m, 1, m) OVER(ORDER BY d) = m, 0, 1) AS n FROM @t),
c2 AS(SELECT m, SUM(n) OVER(ORDER BY d) AS n FROM c1) 
    SELECT m, COUNT(*) AS c
    FROM c2
    GROUP BY m, n

Output:

m   c
A   2
B   1
A   3
B   1

The idea is to get value 1 at rows where message is changed:

2015-03-01  A   0
2015-03-02  A   0
2015-03-03  B   1
2015-03-04  A   1
2015-03-05  A   0
2015-03-06  A   0
2015-03-07  B   1

The second step is just sum of current row value + all preceding values:

2015-03-01  A   0
2015-03-02  A   0
2015-03-03  B   1
2015-03-04  A   2
2015-03-05  A   2
2015-03-06  A   2
2015-03-07  B   3

This way you get grouping sets by message column and calculated column.

like image 42
Giorgi Nakeuri Avatar answered Nov 06 '22 22:11

Giorgi Nakeuri