When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing <code>COUNT(DISTINCT)</code> returns the same value as <code>COUNT()</code>: <img src="https://i.stack.imgur.com/Mt6Ve.png" alt="incorrect count value for userid when connector is postgres"> My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce: <img src="https://i.stack.imgur.com/66qfS.png" alt="correct count value for userid when connector is a csv file"> My PostgreSQL Connector is connecting to Amazon Redshift (<code>jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com</code>) with the following custom query: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT userid, submissionid, date FROM mytable </code></pre> <hr> <h3>Workaround</h3> If I stop using the default <code>date</code> field for the Date Dimension and aggregate my own dates directly in within the SQL query (<code>date_byweek</code>), the <code>COUNT(DISTINCT)</code> aggregation works as expected: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT userid, submissionid, to_char(date,'YYYY-IW') as date_byweek FROM mytable </code></pre> While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞 <hr> <h3>How to Reproduce</h3> If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice: <pre class="prettyprint"><code>> SELECT * FROM mytable userid submissionid -------- ------------- 1 1 2 2 1 3 1 4 3 5 > COUNT(DISTINCT userid) -- ERROR: Returns 5 when data source is PostgreSQL > COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above) </code></pre>

I'm happy to report that as of Sep 17 2020, there's a workaround. DataStudio added the <code>DATETIME_TRUNC</code> function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug. Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020). This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.

Inaccurate COUNT DISTINCT Aggregation with Date dimension in Google Data Studio

Tags:

amazon-redshift

google-data-studio

When I aggregate values in Google Data Studio with a date dimension on a PostgreSQL Connector, I see buggy behaviour. The symptom is that performing COUNT(DISTINCT) returns the same value as COUNT():

incorrect count value for userid when connector is postgres

My theory is that it has something to do with the aggregation on the data occurring after the count has already happened. If I attempt the exact same aggregation on the same data in an exported CSV instead of directly from a PostgreSQL Connector Data Source, the issue does not reproduce:

correct count value for userid when connector is a csv file

My PostgreSQL Connector is connecting to Amazon Redshift (jdbc:postgresql://*******.eu-west-1.redshift.amazonaws.com) with the following custom query:

SELECT
  userid,
  submissionid,
  date
FROM mytable

Workaround

If I stop using the default date field for the Date Dimension and aggregate my own dates directly in within the SQL query (date_byweek), the COUNT(DISTINCT) aggregation works as expected:

SELECT
  userid,
  submissionid,
  to_char(date,'YYYY-IW') as date_byweek
FROM mytable

While this workaround solves my immediate problem, it sucks because I miss out on all the date functionality provided by Data Studio (Hierarchy Drill Down, Date Range filtering, etc.). Not to mention reducing my confidence at what else may be "buggy" within the product 😞

How to Reproduce

If you'd like to re-create the issue, using the following data as a PostgreSQL Data Source should suffice:

> SELECT * FROM mytable
  userid  submissionid
-------- -------------
       1             1
       2             2
       1             3
       1             4
       3             5

> COUNT(DISTINCT userid) -- ERROR:    Returns 5 when data source is PostgreSQL
> COUNT(DISTINCT userid) -- EXPECTED: Returns 3 when data source is CSV (exported from same PostgreSQL query above)

446

asked Jun 03 '19 17:06

Paulo

1 Answers

I'm happy to report that as of Sep 17 2020, there's a workaround.

DataStudio added the DATETIME_TRUNC function (see here https://support.google.com/datastudio/answer/9729685?), that allows you to add a custom field that truncs the original date to whatever granularity you want, without causing the distinct bug.

Attempting to set the display granularity in the report still causes the bug (i.e., you'll still set Oct 1 2020 12:00:00 instead of Oct 2020).

This can be solved by creating a SECOND custom field, which just returns the first, and then you can add IT to the report, change the display granularity, and everything will work OK.

150

answered Oct 16 '22 02:10

yassa

Related questions
                            
                                How to disable using cache results in Redshift Query?
                            
                                How to handle quoted values in AWS Redshift unload command?
                            
                                How to find the privileges granted to a user in AWS Redshift?
                            
                                How to connect to Amazon Redshift or other DB's in Apache Spark?
                            
                                How to query for a table's primary keys in Redshift
                            
                                How to find definition of user defined function in AWS Redshift
                            
                                InternalError_: Spectrum Scan Error. S3 to Redshift copy command
                            
                                Update Redshift table from query
                            
                                How to know the type of a value returned by a Redshift query?
                            
                                What's the best way to create RFC-4180-friendly CSV files from Amazon Redshift UNLOAD?
                            
                                How can I create on-demand reports once they become too slow for our DB?
                            
                                Main causes for leader node to be at high CPU
                            
                                Escaping delimiter in Amazon Redshift COPY command
                            
                                Redshift : could not complete because of conflict with concurrent transaction
                            
                                How to select tables from different databases in Redshift

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With