I was wondering if anyone could help explain why using GROUP BY
on a constant column that is VARCHAR
vs. one that is INTEGER
gives different behaviours.
Here is my minimal working example. The following table mocks the real-world data for which I found this problem off of:
CREATE TABLE test.show_bug AS
WITH integers AS (
SELECT 0 AS num
UNION SELECT 1 AS num
UNION SELECT 2 AS num
UNION SELECT 3 AS num
UNION SELECT 4 AS num
UNION SELECT 5 AS num
)
SELECT
'2017-03-16' + mod(a.num, 2) AS date_time
, CASE mod(b.num, 3)
WHEN 0 THEN 'source_a'
WHEN 1 THEN 'source_b'
WHEN 2 THEN 'source_c'
END AS user_source
, b.num || a.num || b.num || a.num || b.num AS user_id
FROM integers AS a
CROSS JOIN integers AS b
;
This looks like:
date_time | user_source | user_id
------------+-------------+---------
2017-03-17 | source_a | 3113313
2017-03-17 | source_b | 4114414
2017-03-17 | source_b | 1111111
2017-03-16 | source_a | 0000000
2017-03-16 | source_c | 2442242
2017-03-16 | source_c | 5225525
....
(36 rows)
Essentially I want to be able to COUNT
(for each day) the number of users, number of sources, and the amount of users per source. However, I have two different tables of the same format from which I want to UNION
the results together. I can differentiate these results by adding a constant column to each: app_1
and app_2
.
For the sake of example I'm using the same mock table twice, but in real-world application I have two different tables, either way the following sql should get my desired result:
SELECT
'app_1' AS app
, date_time
, COUNT(user_source)
, COUNT(DISTINCT user_source)
, COUNT(DISTINCT user_id)
FROM test.show_bug
GROUP BY 1, 2
UNION
SELECT
'app_2' AS app
, date_time
, COUNT(user_source)
, COUNT(DISTINCT user_source)
, COUNT(DISTINCT user_id)
FROM test.show_bug
GROUP BY 1, 2
This results in
app | date_trunc | count | count1 | count2
----------+---------------------+-------+--------+--------
app_1 | 2017-03-16 00:00:00 | 2 | 1 | 0
app_1 | 2017-03-17 00:00:00 | 2 | 1 | 0
app_1 | 2017-03-19 00:00:00 | 5 | 0 | 1
app_2 | 2017-03-19 00:00:00 | 7 | 1 | 0
app_1 | 2017-03-16 00:00:00 | 0 | 1 | 0
....
(112 rows)
which is not correct because what I'd actually expect is the result attained from using the integer values 1
and 2
in place of the VARCHAR
values app_1
and app_2
, i.e. like:
SELECT
1 AS app
, date_time
, COUNT(user_source)
, COUNT(DISTINCT user_source)
, COUNT(DISTINCT user_id)
FROM test.show_bug
GROUP BY 1, 2
UNION
SELECT
2 AS app
, date_time
, COUNT(user_source)
, COUNT(DISTINCT user_source)
, COUNT(DISTINCT user_id)
FROM test.show_bug
GROUP BY 1, 2
which gives me:
app | date_trunc | count | count1 | count2
----------+---------------------+-------+--------+--------
1 | 2017-03-16 00:00:00 | 192 | 16 | 192
1 | 2017-03-17 00:00:00 | 208 | 14 | 208
1 | 2017-03-18 00:00:00 | 203 | 14 | 203
1 | 2017-03-19 00:00:00 | 203 | 14 | 203
1 | 2017-03-20 00:00:00 | 35 | 0 | 35
2 | 2017-03-16 00:00:00 | 192 | 16 | 192
2 | 2017-03-17 00:00:00 | 208 | 14 | 208
2 | 2017-03-18 00:00:00 | 203 | 14 | 203
2 | 2017-03-19 00:00:00 | 203 | 14 | 203
2 | 2017-03-20 00:00:00 | 35 | 0 | 35
This effect is also seen if I don't use the UNION
.
There are some obvious workarounds to get the result I desire, however the fundamental issue here is that there seems to be an unintuitive difference in behaviour with the use of VARCHAR
constant columns instead of INTEGER
constant columns.
If someone could help me understand what this difference is I would be greatly appreciative.
I'd say you found a genuine bug in Amazon Redshift, or at least a behaviour that is undesired.
I managed to narrow it down to:
VARCHAR
is used as one of the GROUP BY, andCOUNT(DISTINCT)
statementsSo, this simple statement also produces too many results:
SELECT
'1',
COUNT(DISTINCT user_source),
COUNT(DISTINCT user_id)
FROM show_bug
GROUP BY 1
Yet this is okay:
SELECT
'1'::INTEGER,
COUNT(DISTINCT user_source),
COUNT(DISTINCT user_id)
FROM show_bug
GROUP BY 1
Removing either of the COUNT(DISTINCT)
entries also makes it work fine.
If you are subscribed to AWS Support, I recommend that you lodge a bug report. If you are not subscribed to support, you can lodge it via the AWS Support Forums, but they will not have guaranteed response times.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With