Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery COUNT(DISTINCT value) vs COUNT(value)

I found a glitch/bug in bigquery. We got a table based on Bank Statistic data under the starschema.net:clouddb:bank.Banks_token

If i run the following query:

SELECT count(*) as totalrow, count(DISTINCT BankId ) as bankidcnt FROM bank.Banks_token; 

And i get the following result:

Row totalrow    bankidcnt     1   9513    9903     

My problem is that if i have 9513row how could i get 9903row, which is 390row more than the rowcount in the table.

like image 966
Balazs Gunics Avatar asked May 17 '13 12:05

Balazs Gunics


People also ask

What is distinct in BigQuery?

A SELECT DISTINCT statement discards duplicate rows and returns only the remaining rows. SELECT DISTINCT cannot return columns of the following types: STRUCT. ARRAY.

Does BigQuery count null values?

The Google BigQuery COUNT IF Function returns the number of TRUE values for expression. If there are no input rows, or if the expression evaluates to FALSE or NULL for all rows, Google BigQuery COUNT IF Function returns 0.

What are distinct counts?

The COUNT DISTINCT function returns the number of unique values in the column or expression, as the following example shows. SELECT COUNT (DISTINCT item_num) FROM items; If the COUNT DISTINCT function encounters NULL values, it ignores them unless every value in the specified column is NULL.

What is count (distinct [field]) in Google BigQuery?

Typically in SQL database engines, the use of COUNT (DISTINCT [field]) within a query is used to count the exact number of DISTINCT items within the specified field. In Google BigQuery, however, COUNT (DISTINCT [field]) functions slightly differently due to the massive quantities of data that are often involved when performing queries.

How to compute exact distinct values in BigQuery?

Apart from the COUNT DISTINCT function used for performing the BigQuery Count Unique Values Job, you can use the EXACT_COUNT_DISTINCT function to Compute Exact Distinct Values. There are cases when you do not have to worry about the query performance or the processing time rather ensure that exact quantities are returned from the input.

What is the use of Count in BigQuery?

The use of the COUNT function in Bigquery is to return a single value from the number of rows in the input. The DISTINCT clause with COUNT is used only to eliminate any duplicate row in the table. Apart from the DISTINCT clause, one can also use the OVER clause, which is optional and states a window.

How do you count distinct values in a table?

The correct syntax for using COUNT (DISTINCT) is: The distinct count will be based off the column in parenthesis. The result set should only be one row, an integer/number of the column you’re counting distinct values of. You want to know how many customers have a ‘Purchased’ status from your Customers table.


2 Answers

In BigQuery, COUNT DISTINCT is a statistical approximation for all results greater than 1000.

You can provide an optional second argument to give the threshold at which approximations are used. So if you use COUNT(DISTINCT BankId, 10000) in your example, you should see the exact result (since the actual amount of rows is less than 10000). Note, however, that using a larger threshold can be costly in terms of performance.

See the complete documentation here: https://developers.google.com/bigquery/docs/query-reference#aggfunctions


UPDATE 2017:

With BigQuery #standardSQL COUNT(DISTINCT) is always exact. For approximate results use APPROX_COUNT_DISTINCT(). Why would anyone use approx results? See this article.

like image 138
Jeremy Condit Avatar answered Oct 20 '22 03:10

Jeremy Condit


I've used EXACT_COUNT_DISTINCT() as a way to get the exact unique count. It's cleaner and more general than COUNT(DISTINCT value, n > numRows)

Found here: https://cloud.google.com/bigquery/query-reference#aggfunctions

like image 43
smntx Avatar answered Oct 20 '22 03:10

smntx