Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need SQL Server Query to solve 3rd Order Polynomial Regression

Can anyone help with some SQL query code to provide estimates of the co-efficients for a 3rd order Polynomial regression?

Please assume that I have a table of X and Y data values and want to estimate a, b and c in:

Y(X) = aX + bX^2 + cX^3 + E 
like image 878
user1211104 Avatar asked Feb 15 '12 11:02

user1211104


2 Answers

APPROXIMATE but fast solution would be to sample 4 representative points from the data and solve the polynomial equation for these points.

  1. As for the sampling, you can split the data into equal sectors and compute average of X and Y for each sector - the split can be done using quartiles of X-values, averages of X-values, min(x)+(max(x)-min(x))/4 or whatever you think is the most appropriate.

    To illustrate the sampling by quartiles (i.e. by row numbers): illustration of solving 3rd order polynomial by sampling 4 points

  2. As for the solving, i used numberempire.com to solve these* equations for variables k,a,b,c:

    k + a*X1 + b*X1^2 + c*X1^3 - Y1 = 0,
    k + a*X2 + b*X2^2 + c*X2^3 - Y2 = 0,
    k + a*X3 + b*X3^2 + c*X3^3 - Y3 = 0,
    k + a*X4 + b*X4^2 + c*X4^3 - Y4 = 0
    

    *Since Y(X) = 0 + ax bx^2 + cx^3 + ϵ implicitly includes [0, 0] point as one of the sample points, it would create bad approximations for data sets that don't include [0, 0]. I took the liberty of solving Y(X) = k + ax bx^2 + cx^3 + ϵ instead.

The actual SQL would go like this:

select
    -- returns 1 row with columns labeled K, A, B and C = coefficients in 3rd order polynomial equation for the 4 sample points
    -(X1*(X2p2*(X3p3*Y4-X4p3*Y3)+X2p3*(X4p2*Y3-X3p2*Y4)+(X3p2*X4p3-X3p3*X4p2)*Y2)+X1p2*(X2*(X4p3*Y3-X3p3*Y4)+X2p3*(X3*Y4-X4*Y3)+(X3p3*X4-X3*X4p3)*Y2)+X1p3*(X2*(X3p2*Y4-X4p2*Y3)+X2p2*(X4*Y3-X3*Y4)+(X3*X4p2-X3p2*X4)*Y2)+(X2*(X3p3*X4p2-X3p2*X4p3)+X2p2*(X3*X4p3-X3p3*X4)+X2p3*(X3p2*X4-X3*X4p2))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4))  as k,
    (X1p2*(X2p3*(Y4-Y3)-X3p3*Y4+X4p3*Y3+(X3p3-X4p3)*Y2)+X2p2*(X3p3*Y4-X4p3*Y3)+X1p3*(X3p2*Y4+X2p2*(Y3-Y4)-X4p2*Y3+(X4p2-X3p2)*Y2)+X2p3*(X4p2*Y3-X3p2*Y4)+(X3p2*X4p3-X3p3*X4p2)*Y2+(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4))  as a,
    -(X1*(X2p3*(Y4-Y3)-X3p3*Y4+X4p3*Y3+(X3p3-X4p3)*Y2)+X2*(X3p3*Y4-X4p3*Y3)+X1p3*(X3*Y4+X2*(Y3-Y4)-X4*Y3+(X4-X3)*Y2)+X2p3*(X4*Y3-X3*Y4)+(X3*X4p3-X3p3*X4)*Y2+(X2*(X4p3-X3p3)-X3*X4p3+X3p3*X4+X2p3*(X3-X4))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4))  as b,
    (X1*(X2p2*(Y4-Y3)-X3p2*Y4+X4p2*Y3+(X3p2-X4p2)*Y2)+X2*(X3p2*Y4-X4p2*Y3)+X1p2*(X3*Y4+X2*(Y3-Y4)-X4*Y3+(X4-X3)*Y2)+X2p2*(X4*Y3-X3*Y4)+(X3*X4p2-X3p2*X4)*Y2+(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))*Y1)/(X1*(X2p2*(X4p3-X3p3)-X3p2*X4p3+X3p3*X4p2+X2p3*(X3p2-X4p2))+X2*(X3p2*X4p3-X3p3*X4p2)+X1p2*(X3*X4p3+X2*(X3p3-X4p3)+X2p3*(X4-X3)-X3p3*X4)+X2p2*(X3p3*X4-X3*X4p3)+X1p3*(X2*(X4p2-X3p2)-X3*X4p2+X3p2*X4+X2p2*(X3-X4))+X2p3*(X3*X4p2-X3p2*X4))  as c
  from (select
      samples.*,
      -- precomputing the powers should give better performance (at least i hope it would)
      power(X1,2) X1p2, power(X2,2) X2p2, power(X3,2) X3p2, power(X4,2) X4p2,
      power(Y1,3) Y1p3, power(Y2,3) Y2p3, power(Y3,3) Y3p3, power(Y4,3) Y4p3
    from (select
        avg(case when sector = 1 then x end) X1,
        avg(case when sector = 2 then x end) X2,
        avg(case when sector = 3 then x end) X3,
        avg(case when sector = 4 then x end) X4,
        avg(case when sector = 1 then y end) Y1,
        avg(case when sector = 2 then y end) Y2,
        avg(case when sector = 3 then y end) Y3,
        avg(case when sector = 4 then y end) Y4
      from (select x, y, 
          -- splitting to sectors 1 - 4 by row number (SQL Server version)
          ceiling(row_number() OVER (ORDER BY x asc) / count(*) * 4) sector
        from original_data
      )
    ) samples
  )

According to developer.mimer.com, these optional features need to be enabled in SQL Server:

T611, "Elementary OLAP operations"
F591, "Derived tables"
like image 188
Aprillion Avatar answered Sep 18 '22 13:09

Aprillion


SQL Server has a built-in ranking function NTILE(n) which will more easily create your sectors. I replaced:

ceiling(row_number() OVER (ORDER BY x asc) / count(*) * 4) sector

with:

NTILE(4) OVER(ORDER BY x ASC) [sector]

I also needed to add several "precomputed powers" to allow for the full column range as selected. The full list appears below:

POWER(samples.X1, 2) AS [X1p2], 
POWER(samples.X1, 3) AS [X1p3], 
POWER(samples.X2, 2) AS [X2p2], 
POWER(samples.X2, 3) AS [X2p3],
POWER(samples.X3, 2) AS [X3p2], 
POWER(samples.X3, 3) AS [X3p3], 
POWER(samples.X4, 2) AS [X4p2],
POWER(samples.X4, 3) AS [X4p3],
POWER(samples.Y1, 3) AS [Y1p3], 
POWER(samples.Y2, 3) AS [Y2p3], 
POWER(samples.Y3, 3) AS [Y3p3], 
POWER(samples.Y4, 3) AS [Y4p3]

Overall, great answer by @Aprillion! Well explained and the numberempire.com h/t was very helpful.

like image 43
rhp997 Avatar answered Sep 21 '22 13:09

rhp997