When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT)
returns the same value as COUNT()
:
My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:
My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com
) with the following custom query:
SELECT
userid,
submissionid,
date
FROM mytable
If I stop using the default date
field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek
), the COUNT(DISTINCT)
aggregation works as expected:
SELECT
userid,
submissionid,
to_char(date,'YYYY-IW') as date_byweek
FROM mytable
While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞
If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:
> SELECT * FROM mytable
userid submissionid
-------- -------------
1 1
2 2
1 3
1 4
3 5
> COUNT(DISTINCT userid) -- ERROR: Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)
The COUNT_DISTINCT function takes 1 parameter, which can be the name of a metric, dimension, or expression of any type. COUNT_DISTINCT returns the total number of unique items in that field or expression. To count all items, including duplicates, use COUNT .
You can't change the Auto aggregation method. No aggregation is applied. The field is treated as a dimension, even if it contains numeric data. Note: This method only appears in the data source.
You can treat any dimension as a metric in your charts: simply drag the field from the list to the metrics section of the chart configuration, then select the type of aggregation to apply. Dimensions containing non-numeric data (e.g. text, date) will be aggregated using Count Distinct (CTD).
A dimension in Data Studio is usually text, geographic coordinates, a Boolean choice, a date or time or a URL. A metric can be a number, a percentage, a duration or an amount of currency. Metrics are further broken down by how they're aggregated.
I'm happy to report that as of Sep 17 2020, there's a workaround.
DataStudio added the DATETIME_TRUNC
function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.
Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).
This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With