BigQuery has some statistical aggregation functions such as STDDEV(X) and CORR(X, Y), but it doesn't offer functions to directly perform linear regression. How can one compute a linear regression using the functions that do exist?

Editor's edit: Please see next answer, linear regression is now natively supported in BigQuery. --Fh <hr> The following query performs a linear regression using calculations that are numerically stable and easily modified to work over any input table. It produces the slope and intercept of the best fit to the model Y = SLOPE * X + INTERCEPT and the Pearson correlation coefficient using the builtin function CORR. As an example, we use the public natality dataset to compute birth weight as a linear function of the duration of pregnancy, broken down by state. You could write this more compactly, but we use several layers of subqueries to highlight how the pieces go together. To apply this to another dataset, you just need to replace the innermost query. <pre class="prettyprint"><code>SELECT Bucket, SLOPE, (SUM_OF_Y - SLOPE * SUM_OF_X) / N AS INTERCEPT, CORRELATION FROM ( SELECT Bucket, N, SUM_OF_X, SUM_OF_Y, CORRELATION * STDDEV_OF_Y / STDDEV_OF_X AS SLOPE, CORRELATION FROM ( SELECT Bucket, COUNT(*) AS N, SUM(X) AS SUM_OF_X, SUM(Y) AS SUM_OF_Y, STDDEV_POP(X) AS STDDEV_OF_X, STDDEV_POP(Y) AS STDDEV_OF_Y, CORR(X,Y) AS CORRELATION FROM (SELECT state AS Bucket, gestation_weeks AS X, weight_pounds AS Y FROM [publicdata.samples.natality]) WHERE Bucket IS NOT NULL AND X IS NOT NULL AND Y IS NOT NULL GROUP BY Bucket)); </code></pre> Using the STDDEV_POP and CORR functions improves the numerical stability of this query compared to summing up products of X and Y and then taking differences and dividing, but if you use both approaches on a well-behaved dataset, you can verify that they produce the same results to high accuracy.

How to perform linear regression in BigQuery?

1 Answers

Editor's edit: Please see next answer, linear regression is now natively supported in BigQuery. --Fh

The following query performs a linear regression using calculations that are numerically stable and easily modified to work over any input table. It produces the slope and intercept of the best fit to the model Y = SLOPE * X + INTERCEPT and the Pearson correlation coefficient using the builtin function CORR.

As an example, we use the public natality dataset to compute birth weight as a linear function of the duration of pregnancy, broken down by state. You could write this more compactly, but we use several layers of subqueries to highlight how the pieces go together. To apply this to another dataset, you just need to replace the innermost query.

SELECT Bucket,
       SLOPE,
       (SUM_OF_Y - SLOPE * SUM_OF_X) / N AS INTERCEPT,
       CORRELATION
FROM (
    SELECT Bucket,
           N,
           SUM_OF_X,
           SUM_OF_Y,
           CORRELATION * STDDEV_OF_Y / STDDEV_OF_X AS SLOPE,
           CORRELATION
    FROM (
        SELECT Bucket,
               COUNT(*) AS N,
               SUM(X) AS SUM_OF_X,
               SUM(Y) AS SUM_OF_Y,
               STDDEV_POP(X) AS STDDEV_OF_X,
               STDDEV_POP(Y) AS STDDEV_OF_Y,
               CORR(X,Y) AS CORRELATION
        FROM (SELECT state AS Bucket,
                     gestation_weeks AS X,
                     weight_pounds AS Y
              FROM [publicdata.samples.natality])
        WHERE Bucket IS NOT NULL AND
              X IS NOT NULL AND
              Y IS NOT NULL
        GROUP BY Bucket));

Using the STDDEV_POP and CORR functions improves the numerical stability of this query compared to summing up products of X and Y and then taking differences and dividing, but if you use both approaches on a well-behaved dataset, you can verify that they produce the same results to high accuracy.

127

answered Sep 16 '22 13:09

sprocket

Related questions
                            
                                Authorization for accessing BigQuery from R session on server
                            
                                First row for each group
                            
                                Error: Scalar subquery produced more than one element
                            
                                Read from BigQuery into Spark in efficient way?
                            
                                Is it possible to retrieve an extended or full query history in google bigquery?
                            
                                Can BigQuery be fast enough for real-time onsite request
                            
                                How to use bigquery round up results to 4 digits after decimal point?
                            
                                google bigquery select from a timestamp column between now and n days ago
                            
                                Joins on Google Bigquery
                            
                                Delete BigQuery tables with wildcard
                            
                                Stream BigQuery table into Google Pub/Sub
                            
                                Filling missing dates in BigQuery (SQL) without creating a new calendar
                            
                                Google Big Query SQL - Get Most Recent Column Value
                            
                                To remove double quotes from date string in SQL
                            
                                Count the number of occurences of a character in a string - BigQuery
                            
                                BigQuery GROUP_CONCAT and ORDER BY
                            
                                BigQuery export table to csv file
                            
                                How can I select the last index of a column split with bigquery
                            
                                BigQuery - Illegal Escape Sequence
                            
                                How to Remove Diacritic Marks (such as Accents) using Unicode Normalization in Standard SQL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to perform linear regression in BigQuery?

Tags:

google-bigquery

sprocket

People also ask

1 Answers

sprocket

Recent Activity

Donate For Us