RedShift GROUP BY Constant Column Gives Inconsistent Results

Question

I was wondering if anyone could help explain why using GROUP BY on a constant column that is VARCHAR vs. one that is INTEGER gives different behaviours.

Here is my minimal working example. The following table mocks the real-world data for which I found this problem off of:

CREATE TABLE test.show_bug AS

WITH integers AS (
    SELECT 0 AS num
    UNION SELECT 1 AS num
    UNION SELECT 2 AS num
    UNION SELECT 3 AS num
    UNION SELECT 4 AS num
    UNION SELECT 5 AS num
)
SELECT 
    '2017-03-16' + mod(a.num, 2) AS date_time 
    , CASE mod(b.num, 3)
        WHEN 0 THEN 'source_a'
        WHEN 1 THEN 'source_b'
        WHEN 2 THEN 'source_c'
        END AS user_source
    , b.num || a.num || b.num || a.num || b.num AS user_id
FROM integers AS a
CROSS JOIN integers AS b
;

This looks like:

 date_time  | user_source | user_id
------------+-------------+---------
 2017-03-17 | source_a    | 3113313
 2017-03-17 | source_b    | 4114414
 2017-03-17 | source_b    | 1111111
 2017-03-16 | source_a    | 0000000
 2017-03-16 | source_c    | 2442242
 2017-03-16 | source_c    | 5225525
....
(36 rows)

Essentially I want to be able to COUNT (for each day) the number of users, number of sources, and the amount of users per source. However, I have two different tables of the same format from which I want to UNION the results together. I can differentiate these results by adding a constant column to each: app_1 and app_2.

For the sake of example I'm using the same mock table twice, but in real-world application I have two different tables, either way the following sql should get my desired result:

SELECT 
    'app_1' AS app
    , date_time
    , COUNT(user_source)
    , COUNT(DISTINCT user_source)
    , COUNT(DISTINCT user_id)
FROM test.show_bug
GROUP BY 1, 2

UNION

SELECT 
    'app_2' AS app
    , date_time
    , COUNT(user_source)
    , COUNT(DISTINCT user_source)
    , COUNT(DISTINCT user_id)
FROM test.show_bug
GROUP BY 1, 2

This results in

   app    |     date_trunc      | count | count1 | count2
----------+---------------------+-------+--------+--------
 app_1    | 2017-03-16 00:00:00 |     2 |      1 |      0
 app_1    | 2017-03-17 00:00:00 |     2 |      1 |      0
 app_1    | 2017-03-19 00:00:00 |     5 |      0 |      1
 app_2    | 2017-03-19 00:00:00 |     7 |      1 |      0
 app_1    | 2017-03-16 00:00:00 |     0 |      1 |      0
....
(112 rows)

which is not correct because what I'd actually expect is the result attained from using the integer values 1 and 2 in place of the VARCHAR values app_1 and app_2, i.e. like:

SELECT 
    1 AS app
    , date_time
    , COUNT(user_source)
    , COUNT(DISTINCT user_source)
    , COUNT(DISTINCT user_id)
FROM test.show_bug
GROUP BY 1, 2

UNION

SELECT 
    2 AS app
    , date_time
    , COUNT(user_source)
    , COUNT(DISTINCT user_source)
    , COUNT(DISTINCT user_id)
FROM test.show_bug
GROUP BY 1, 2

which gives me:

   app    |     date_trunc      | count | count1 | count2
----------+---------------------+-------+--------+--------
        1 | 2017-03-16 00:00:00 |   192 |     16 |    192
        1 | 2017-03-17 00:00:00 |   208 |     14 |    208
        1 | 2017-03-18 00:00:00 |   203 |     14 |    203
        1 | 2017-03-19 00:00:00 |   203 |     14 |    203
        1 | 2017-03-20 00:00:00 |    35 |      0 |     35
        2 | 2017-03-16 00:00:00 |   192 |     16 |    192
        2 | 2017-03-17 00:00:00 |   208 |     14 |    208
        2 | 2017-03-18 00:00:00 |   203 |     14 |    203
        2 | 2017-03-19 00:00:00 |   203 |     14 |    203
        2 | 2017-03-20 00:00:00 |    35 |      0 |     35

This effect is also seen if I don't use the UNION.

There are some obvious workarounds to get the result I desire, however the fundamental issue here is that there seems to be an unintuitive difference in behaviour with the use of VARCHAR constant columns instead of INTEGER constant columns.

If someone could help me understand what this difference is I would be greatly appreciative.

John Rotenstein · Accepted Answer

I'd say you found a genuine bug in Amazon Redshift, or at least a behaviour that is undesired.

I managed to narrow it down to:

Only when a VARCHAR is used as one of the GROUP BY, and
When using multiple COUNT(DISTINCT) statements

So, this simple statement also produces too many results:

SELECT 
    '1',
    COUNT(DISTINCT user_source),
    COUNT(DISTINCT user_id)
FROM show_bug
GROUP BY 1

Yet this is okay:

SELECT 
    '1'::INTEGER,
    COUNT(DISTINCT user_source),
    COUNT(DISTINCT user_id)
FROM show_bug
GROUP BY 1

Removing either of the COUNT(DISTINCT) entries also makes it work fine.

If you are subscribed to AWS Support, I recommend that you lodge a bug report. If you are not subscribed to support, you can lodge it via the AWS Support Forums, but they will not have guaranteed response times.

RedShift GROUP BY Constant Column Gives Inconsistent Results

Tags:

sql

amazon-redshift

Eric Jalbert

1 Answers

John Rotenstein

Recent Activity

Donate For Us

RedShift GROUP BY Constant Column Gives Inconsistent Results

Tags:

sql

amazon-redshift

Eric Jalbert

1 Answers

John Rotenstein

Related questions

Recent Activity

Donate For Us