Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery: compute entropy of a column

I have a suggestion for the BQ folks: I think it would be very useful if there was a built-in function that would return the entropy of a column. A column of discrete categories or values would be relatively easy. Thoughts? Does this already exist but I didn't find it?

like image 650
SheRey Avatar asked Oct 27 '25 02:10

SheRey


1 Answers

The simple solution is below - it computes number of distinct values in the column, and then takes logarithm on base 2 - this gives the number of bits needed to encode all different values, which is the column entropy.

SELECT LOG2(COUNT(DISTINCT column)) FROM Table

However, this doesn't take into account the fact that different values have different probabilities. Shannon entropy formula is -SUM(P(xi)*log(P(xi)) where P(xi) is probability of value xi. Here is an example how to compute that in BigQuery, Shannon entropy for column year in natality table:

select -sum(p*log2(p)) from (
select ratio_to_report(c) over() p from (
select year, count(*) c from publicdata:samples.natality group by 1))

UPDATE In case the column variable is not discrete type (i.e. FLOAT), it is possible to discretize the values. Example below shows one approach - first it finds the maximum and minimum values, computes the range, and then puts all the FLOAT values (weight_pound column in natality table) into 100 buckets. After that - problem is reduced to the entropy of INTEGER values.

select discrete_weight, count(*) from (
select 
  cast((weight_pounds - min_weight) * 100 / range_weight as integer)
    as discrete_weight 
from [publicdata:samples.natality] a cross join 
(select 
  min(weight_pounds) as min_weight, 
  max(weight_pounds) - min(weight_pounds) as range_weight 
from [publicdata:samples.natality]) b) group by 1
like image 50
Mosha Pasumansky Avatar answered Oct 29 '25 08:10

Mosha Pasumansky



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!