Are there any Linear Regression Function in SQL Server 2005/2008, similar to the the Linear Regression functions in Oracle ?

To the best of my knowledge, there is none. Writing one is pretty straightforward, though. The following gives you the constant alpha and slope beta for y = Alpha + Beta * x + epsilon: <pre class="prettyprint"><code>-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance) WITH some_table(GroupID, x, y) AS ( SELECT 1, 1, 1 UNION SELECT 1, 2, 2 UNION SELECT 1, 3, 1.3 UNION SELECT 1, 4, 3.75 UNION SELECT 1, 5, 2.25 UNION SELECT 2, 95, 85 UNION SELECT 2, 85, 95 UNION SELECT 2, 80, 70 UNION SELECT 2, 70, 65 UNION SELECT 2, 60, 70 UNION SELECT 3, 1, 2 UNION SELECT 3, 1, 3 UNION SELECT 4, 1, 2 UNION SELECT 4, 2, 2), -- linear regression query /*WITH*/ mean_estimates AS ( SELECT GroupID ,AVG(x * 1.) AS xmean ,AVG(y * 1.) AS ymean FROM some_table GROUP BY GroupID ), stdev_estimates AS ( SELECT pd.GroupID -- T-SQL STDEV() implementation is not numerically stable ,CASE SUM(SQUARE(x - xmean)) WHEN 0 THEN 1 ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev , SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1)) AS ystdev FROM some_table pd INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID GROUP BY pd.GroupID, pm.xmean, pm.ymean ), standardized_data AS -- increases numerical stability ( SELECT pd.GroupID ,(x - xmean) / xstdev AS xstd ,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd FROM some_table pd INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID ), standardized_beta_estimates AS ( SELECT GroupID ,CASE WHEN SUM(xstd * xstd) = 0 THEN 0 ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END AS betastd FROM standardized_data pd GROUP BY GroupID ) SELECT pb.GroupID ,ymean - xmean * betastd * ystdev / xstdev AS Alpha ,betastd * ystdev / xstdev AS Beta FROM standardized_beta_estimates pb INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID INNER JOIN mean_estimates pm ON pm.GroupID = pb.GroupID </code></pre> Here <code>GroupID</code> is used to show how to group by some value in your source data table. If you just want the statistics across all data in the table (not specific sub-groups), you can drop it and the joins. I have used the <code>WITH</code> statement for sake of clarity. As an alternative, you can use sub-queries instead. Please be mindful of the precision of the data type used in your tables as the numerical stability can deteriorate quickly if the precision is not high enough relative to your data. EDIT: (in answer to Peter's question for additional statistics like R2 in the comments) You can easily calculate additional statistics using the same technique. Here is a version with R2, correlation, and sample covariance: <pre class="prettyprint"><code>-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance) WITH some_table(GroupID, x, y) AS ( SELECT 1, 1, 1 UNION SELECT 1, 2, 2 UNION SELECT 1, 3, 1.3 UNION SELECT 1, 4, 3.75 UNION SELECT 1, 5, 2.25 UNION SELECT 2, 95, 85 UNION SELECT 2, 85, 95 UNION SELECT 2, 80, 70 UNION SELECT 2, 70, 65 UNION SELECT 2, 60, 70 UNION SELECT 3, 1, 2 UNION SELECT 3, 1, 3 UNION SELECT 4, 1, 2 UNION SELECT 4, 2, 2), -- linear regression query /*WITH*/ mean_estimates AS ( SELECT GroupID ,AVG(x * 1.) AS xmean ,AVG(y * 1.) AS ymean FROM some_table pd GROUP BY GroupID ), stdev_estimates AS ( SELECT pd.GroupID -- T-SQL STDEV() implementation is not numerically stable ,CASE SUM(SQUARE(x - xmean)) WHEN 0 THEN 1 ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev , SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1)) AS ystdev FROM some_table pd INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID GROUP BY pd.GroupID, pm.xmean, pm.ymean ), standardized_data AS -- increases numerical stability ( SELECT pd.GroupID ,(x - xmean) / xstdev AS xstd ,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd FROM some_table pd INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID ), standardized_beta_estimates AS ( SELECT GroupID ,CASE WHEN SUM(xstd * xstd) = 0 THEN 0 ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END AS betastd FROM standardized_data GROUP BY GroupID ) SELECT pb.GroupID ,ymean - xmean * betastd * ystdev / xstdev AS Alpha ,betastd * ystdev / xstdev AS Beta ,CASE ystdev WHEN 0 THEN 1 ELSE betastd * betastd END AS R2 ,betastd AS Correl ,betastd * xstdev * ystdev AS Covar FROM standardized_beta_estimates pb INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID INNER JOIN mean_estimates pm ON pm.GroupID = pb.GroupID </code></pre> EDIT 2 improves numerical stability by standardizing data (instead of only centering) and by replacing <code>STDEV</code> because of numerical stability issues. To me, the current implementation seems to be the best trade-off between stability and complexity. I could improve stability by replacing my standard deviation with a numerically stable online algorithm, but this would complicate the implementation substantantially (and slow it down). Similarly, implementations using e.g. Kahan(-Babu&scaron;ka-Neumaier) compensations for the <code>SUM</code> and <code>AVG</code> seem to perform modestly better in limited tests, but make the query much more complex. And as long as I do not know how T-SQL implements <code>SUM</code> and <code>AVG</code> (e.g. it might already be using pairwise summation), I cannot guarantee that such modifications always improve accuracy.

Are there any Linear Regression Function in SQL Server?

1 Answers

To the best of my knowledge, there is none. Writing one is pretty straightforward, though. The following gives you the constant alpha and slope beta for y = Alpha + Beta * x + epsilon:

-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance) WITH some_table(GroupID, x, y) AS (       SELECT 1,  1,  1    UNION SELECT 1,  2,  2    UNION SELECT 1,  3,  1.3     UNION SELECT 1,  4,  3.75 UNION SELECT 1,  5,  2.25 UNION SELECT 2, 95, 85       UNION SELECT 2, 85, 95    UNION SELECT 2, 80, 70    UNION SELECT 2, 70, 65       UNION SELECT 2, 60, 70    UNION SELECT 3,  1,  2    UNION SELECT 3,  1, 3   UNION SELECT 4,  1,  2    UNION SELECT 4,  2,  2),  -- linear regression query /*WITH*/ mean_estimates AS (   SELECT GroupID           ,AVG(x * 1.)                                             AS xmean           ,AVG(y * 1.)                                             AS ymean     FROM some_table     GROUP BY GroupID ), stdev_estimates AS (   SELECT pd.GroupID           -- T-SQL STDEV() implementation is not numerically stable           ,CASE      SUM(SQUARE(x - xmean)) WHEN 0 THEN 1             ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev           ,     SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1))     AS ystdev     FROM some_table pd     INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID     GROUP BY pd.GroupID, pm.xmean, pm.ymean ), standardized_data AS                   -- increases numerical stability (   SELECT pd.GroupID           ,(x - xmean) / xstdev                                    AS xstd           ,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd     FROM some_table pd     INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID     INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID ), standardized_beta_estimates AS (   SELECT GroupID           ,CASE WHEN SUM(xstd * xstd) = 0 THEN 0                 ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END         AS betastd     FROM standardized_data pd     GROUP BY GroupID ) SELECT pb.GroupID       ,ymean - xmean * betastd * ystdev / xstdev                   AS Alpha       ,betastd * ystdev / xstdev                                   AS Beta FROM standardized_beta_estimates pb INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID INNER JOIN mean_estimates  pm ON pm.GroupID = pb.GroupID

Here GroupID is used to show how to group by some value in your source data table. If you just want the statistics across all data in the table (not specific sub-groups), you can drop it and the joins. I have used the WITH statement for sake of clarity. As an alternative, you can use sub-queries instead. Please be mindful of the precision of the data type used in your tables as the numerical stability can deteriorate quickly if the precision is not high enough relative to your data.

EDIT: (in answer to Peter's question for additional statistics like R2 in the comments)

You can easily calculate additional statistics using the same technique. Here is a version with R2, correlation, and sample covariance:

-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance) WITH some_table(GroupID, x, y) AS (       SELECT 1,  1,  1    UNION SELECT 1,  2,  2    UNION SELECT 1,  3,  1.3     UNION SELECT 1,  4,  3.75 UNION SELECT 1,  5,  2.25 UNION SELECT 2, 95, 85       UNION SELECT 2, 85, 95    UNION SELECT 2, 80, 70    UNION SELECT 2, 70, 65       UNION SELECT 2, 60, 70    UNION SELECT 3,  1,  2    UNION SELECT 3,  1, 3   UNION SELECT 4,  1,  2    UNION SELECT 4,  2,  2),  -- linear regression query /*WITH*/ mean_estimates AS (   SELECT GroupID           ,AVG(x * 1.)                                             AS xmean           ,AVG(y * 1.)                                             AS ymean     FROM some_table pd     GROUP BY GroupID ), stdev_estimates AS (   SELECT pd.GroupID           -- T-SQL STDEV() implementation is not numerically stable           ,CASE      SUM(SQUARE(x - xmean)) WHEN 0 THEN 1             ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev           ,     SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1))     AS ystdev     FROM some_table pd     INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID     GROUP BY pd.GroupID, pm.xmean, pm.ymean ), standardized_data AS                   -- increases numerical stability (   SELECT pd.GroupID           ,(x - xmean) / xstdev                                    AS xstd           ,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd     FROM some_table pd     INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID     INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID ), standardized_beta_estimates AS (   SELECT GroupID           ,CASE WHEN SUM(xstd * xstd) = 0 THEN 0                 ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END         AS betastd     FROM standardized_data     GROUP BY GroupID ) SELECT pb.GroupID       ,ymean - xmean * betastd * ystdev / xstdev                   AS Alpha       ,betastd * ystdev / xstdev                                   AS Beta       ,CASE ystdev WHEN 0 THEN 1 ELSE betastd * betastd END        AS R2       ,betastd                                                     AS Correl       ,betastd * xstdev * ystdev                                   AS Covar FROM standardized_beta_estimates pb INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID INNER JOIN mean_estimates  pm ON pm.GroupID = pb.GroupID

EDIT 2 improves numerical stability by standardizing data (instead of only centering) and by replacing STDEV because of numerical stability issues. To me, the current implementation seems to be the best trade-off between stability and complexity. I could improve stability by replacing my standard deviation with a numerically stable online algorithm, but this would complicate the implementation substantantially (and slow it down). Similarly, implementations using e.g. Kahan(-Babuška-Neumaier) compensations for the SUM and AVG seem to perform modestly better in limited tests, but make the query much more complex. And as long as I do not know how T-SQL implements SUM and AVG (e.g. it might already be using pairwise summation), I cannot guarantee that such modifications always improve accuracy.

answered Oct 21 '22 13:10

stephan

Related questions
                            
                                CAST and IsNumeric
                            
                                Sql Server : How to use an aggregate function like MAX in a WHERE clause
                            
                                Query times out from web app but runs fine from management studio
                            
                                newid() vs newsequentialid() What are the differences/pros and cons?
                            
                                SQL Server Check/NoCheck difference in generated scripts
                            
                                IN vs. JOIN with large rowsets
                            
                                Is it possible to perform multiple updates with a single UPDATE SQL statement?
                            
                                Deploying SQL Server Databases from Test to Live
                            
                                How can I make a SQL temp table with primary key and auto-incrementing field?
                            
                                Read only access to stored procedure contents
                            
                                Find non-default collation on columns for all tables in SQL Server
                            
                                Efficient way of getting @@rowcount from a query using row_number
                            
                                ADD time 23:59:59.999 to end date for between
                            
                                How to format a numeric column as phone number in SQL
                            
                                How would I find the second largest salary from the employee table? [closed]
                            
                                Why doesn't VFP .NET OLEdb provider work in 64 bit Windows?
                            
                                sql use statement with variable
                            
                                How to connect to SQL Server from another computer?
                            
                                Useful system stored procedures in SQL Server
                            
                                INSERT INTO TABLE from comma separated varchar-list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are there any Linear Regression Function in SQL Server?

Tags:

sql-server-2005

sql-server-2008

statistics

linear-regression

rao

People also ask

1 Answers

stephan

Recent Activity

Donate For Us