Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inaccurate COUNT DISTINCT Aggregation with Date dimension in Google Data Studio

When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT) returns the same value as COUNT():

incorrect count value for userid when connector is postgres

My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:

correct count value for userid when connector is a csv file

My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com) with the following custom query:

SELECT
  userid,
  submissionid,
  date
FROM mytable

Workaround

If I stop using the default date field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek), the COUNT(DISTINCT) aggregation works as expected:

SELECT
  userid,
  submissionid,
  to_char(date,'YYYY-IW') as date_byweek
FROM mytable

While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞


How to Reproduce

If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:

> SELECT * FROM mytable
  userid  submissionid
-------- -------------
       1             1
       2             2
       1             3
       1             4
       3             5

> COUNT(DISTINCT userid) -- ERROR:    Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)
like image 446
Paulo Avatar asked Jun 03 '19 17:06

Paulo


People also ask

How do I count unique values in a Google Data Studio?

The COUNT_DISTINCT function takes 1 parameter, which can be the name of a metric, dimension, or expression of any type. COUNT_DISTINCT returns the total number of unique items in that field or expression. To count all items, including duplicates, use COUNT .

How do I change auto aggregation in Data Studio?

You can't change the Auto aggregation method. No aggregation is applied. The field is treated as a dimension, even if it contains numeric data. Note: This method only appears in the data source.

How do I change a dimension to metrics in Google Studio data?

You can treat any dimension as a metric in your charts: simply drag the field from the list to the metrics section of the chart configuration, then select the type of aggregation to apply. Dimensions containing non-numeric data (e.g. text, date) will be aggregated using Count Distinct (CTD).

What is metric and dimension in Data Studio?

A dimension in Data Studio is usually text, geographic coordinates, a Boolean choice, a date or time or a URL. A metric can be a number, a percentage, a duration or an amount of currency. Metrics are further broken down by how they're aggregated.


1 Answers

I'm happy to report that as of Sep 17 2020, there's a workaround.

DataStudio added the DATETIME_TRUNC function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.

Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).

This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.

like image 150
yassa Avatar answered Oct 16 '22 02:10

yassa