Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform linear regression in BigQuery?

BigQuery has some statistical aggregation functions such as STDDEV(X) and CORR(X, Y), but it doesn't offer functions to directly perform linear regression.

How can one compute a linear regression using the functions that do exist?

like image 492
sprocket Avatar asked Sep 13 '16 04:09

sprocket


People also ask

Can you do linear regression in SQL?

Simple Linear Regression is handy for the SQL Programmer in making a prediction of a linear trend and giving a figure for the level probability for the prediction, and what is more, they are easy to do with the aggregation that is built into SQL.

How do you make a model in BigQuery?

To create a model in BigQuery, use the BigQuery ML CREATE MODEL statement. This statement is similar to the CREATE TABLE DDL statement. When you run a standard SQL query that contains a CREATE MODEL statement, a query job is generated for you that processes the query.

Is BigQuery ml good?

Advantages of BigQuery ML There is no need to program an ML solution using Python or Java. Models are trained and accessed in BigQuery using SQL—a language data analysts know. BigQuery ML increases the speed of model development and innovation by removing the need to export data from the data warehouse.


1 Answers

Editor's edit: Please see next answer, linear regression is now natively supported in BigQuery. --Fh


The following query performs a linear regression using calculations that are numerically stable and easily modified to work over any input table. It produces the slope and intercept of the best fit to the model Y = SLOPE * X + INTERCEPT and the Pearson correlation coefficient using the builtin function CORR.

As an example, we use the public natality dataset to compute birth weight as a linear function of the duration of pregnancy, broken down by state. You could write this more compactly, but we use several layers of subqueries to highlight how the pieces go together. To apply this to another dataset, you just need to replace the innermost query.

SELECT Bucket,
       SLOPE,
       (SUM_OF_Y - SLOPE * SUM_OF_X) / N AS INTERCEPT,
       CORRELATION
FROM (
    SELECT Bucket,
           N,
           SUM_OF_X,
           SUM_OF_Y,
           CORRELATION * STDDEV_OF_Y / STDDEV_OF_X AS SLOPE,
           CORRELATION
    FROM (
        SELECT Bucket,
               COUNT(*) AS N,
               SUM(X) AS SUM_OF_X,
               SUM(Y) AS SUM_OF_Y,
               STDDEV_POP(X) AS STDDEV_OF_X,
               STDDEV_POP(Y) AS STDDEV_OF_Y,
               CORR(X,Y) AS CORRELATION
        FROM (SELECT state AS Bucket,
                     gestation_weeks AS X,
                     weight_pounds AS Y
              FROM [publicdata.samples.natality])
        WHERE Bucket IS NOT NULL AND
              X IS NOT NULL AND
              Y IS NOT NULL
        GROUP BY Bucket));

Using the STDDEV_POP and CORR functions improves the numerical stability of this query compared to summing up products of X and Y and then taking differences and dividing, but if you use both approaches on a well-behaved dataset, you can verify that they produce the same results to high accuracy.

like image 127
sprocket Avatar answered Sep 16 '22 13:09

sprocket