<p>I have a bunch of data, generally in the form a, b, c, ..., y</p> <p>where y = f(a, b, c...)</p> <p>Most of them are three and four variables, and have 10k - 10M records. My general assumption is that they are algebraic in nature, something like:</p> <p>y = P1 a^E1 + P2 b^E2 + P3 c^E3 </p> <p>Unfortunately, my last statistical analysis class was 20 years ago. What is the easiest way to get a good approximation of f? Open source tools with a very minimal learning curve (i.e. something where I could get a decent approximation in an hour or so) would be ideal. Thanks!</p>

<p>In case it's useful, here's a Numpy/Scipy (Python) template to do what you want:</p> <pre class="prettyprint"><code>from numpy import array from scipy.optimize import leastsq def __residual(params, y, a, b, c): p0, e0, p1, e1, p2, e2 = params return p0 * a ** e0 + p1 * b ** e1 + p2 * c ** e2 - y # load a, b, c # guess initial values for p0, e0, p1, e1, p2, e2 p_opt = leastsq(__residual, array([p0, e0, p1, e1, p2, e2]), args=(y, a, b, c)) print 'y = %f a^%f + %f b^%f %f c^%f' % map(float, p_opt) </code></pre> <p>If you really want to understand what's going on, though, you're going to have to invest the time to scale the learning curve for some tool or programming environment - I really don't think there's any way around that. People don't generally write specialized tools for doing things like 3-term power regressions exclusively.</p>

<p>I spent over a week trying to do essentially the same thing. I tried a whole bunch of optimization stuff to fine tune the coefficients with basically no success, then I found out that there is a closed form solution and it works really well.</p> <p>Disclaimer: I was trying to fit data with a fixed maximum order of magnitude. If there is no limit to your E1, E2, etc values, then this won't work for you.</p> <p>Now that I've taken the time to learn this stuff, I actually see that some of the answers would have given good hints if I understood them. It had also been a while since my last statistics and linear algebra class.</p> <p>So if there are other people out there who are lacking the linear algebra knowledge, here's what I did.</p> <p>Even though this is not a linear function you are trying to fit, it turns out that this is still a linear regression problem. Wikipedia has a really good article on linear regression. I recommend reading it slowly: https://en.wikipedia.org/wiki/Linear_regression#:~:text=In%20statistics%2C%20linear%20regression%20is,as%20dependent%20and%20independent%20variables). It also links a lot of other good related articles.</p> <p>If you don't know how to do a simple (single variable) linear regression problem using matrices, take some time to learn how to do that.</p> <p>Once you learn how to do simple linear regression, then try some multivariable linear regression. Basically, to do multi variable linear regression, you create an X matrix where there is a row for each of your input data items and each row contains all of the variable values for that data entry (plus a 1 in the last column which is used for the constant value at the end of your polynomial (called an intercept)). Then you create a Y matrix that is a single column with a row for each data item. Then you solve B = (X<sup>T</sup>X)<sup>-1</sup>X<sup>T</sup>Y. B then becomes all of the coefficients for your polynomial.</p> <p>For multi-variable polynomial regression, its the same idea, just now you have a huge multi-variable linear regression where each regressor (variable you're doing regression on) is a coefficient for your giant polynomial expression.</p> <p>So if your input data looks like this:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th>Inputs</th> <th>Output</th> </tr></thead> <tbody> <tr> <td>a1, b1,</td> <td>y1</td> </tr> <tr> <td>a2, b2,</td> <td>y2</td> </tr> <tr> <td>...</td> <td>...</td> </tr> <tr> <td>aN, bN,</td> <td>yN</td> </tr> </tbody> </table> </div> <p>And you want to fit a 2nd order polynomial of the form y = c1<em>a^2</em>b^2 + c2<em>a^2</em>b + c3<em>a^2 + c4</em>a<em>b^2 + c5</em>a<em>b + c6</em>a + c7<em>b^2 + c8</em>b + c9, then your X matrix will look like:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> </tr></thead> <tbody> <tr> <td>a1^2*b1^2</td> <td>a1^2*b1</td> <td>a1^2</td> <td>a1*b1^2</td> <td>a1*b1</td> <td>a1</td> <td>b1^2</td> <td>b1</td> <td>1</td> </tr> <tr> <td>a2^2*b2^2</td> <td>a2^2*b2</td> <td>a2^2</td> <td>a2*b1^2</td> <td>a2*b2</td> <td>a2</td> <td>b2^2</td> <td>b2</td> <td>1</td> </tr> <tr> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <td>aN^2*bN^2</td> <td>aN^2*bN</td> <td>aN^2</td> <td>aN*bN^2</td> <td>aN*bN</td> <td>aN</td> <td>bN^2</td> <td>bN</td> <td>1</td> </tr> </tbody> </table> </div> <p>Your Y matrix is simply:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th></th> </tr></thead> <tbody> <tr> <td>y1</td> </tr> <tr> <td>y2</td> </tr> <tr> <td>...</td> </tr> <tr> <td>yN</td> </tr> </tbody> </table> </div> <p>Then you do B = (X<sup>T</sup>X)<sup>-1</sup>X<sup>T</sup>Y and then B will equal</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th></th> </tr></thead> <tbody> <tr> <td>c1</td> </tr> <tr> <td>c2</td> </tr> <tr> <td>c3</td> </tr> <tr> <td>c4</td> </tr> <tr> <td>c5</td> </tr> <tr> <td>c6</td> </tr> <tr> <td>c7</td> </tr> <tr> <td>c8</td> </tr> <tr> <td>c9</td> </tr> </tbody> </table> </div> <p>Note that the total number of coefficients will be (o + 1)<sup>V</sup> where o is the order of the polynomial and V is the number of variables, so it grows pretty quickly.</p> <p>If you are using good matrix code, then I believe the runtime complexity will be O(((o+1)<sup>V</sup>)<sup>3</sup> + ((o + 1)<sup>V</sup>)<sup>2</sup>N), where V is the number of variables, o is the order of the polynomial, and N is the number of data inputs you have. Initially this sounds pretty terrible, but in most cases, o and V are probably not going to be high, so then this just becomes linear with respect to N.</p> <p>Note that if you are writing your own matrix code, then it is important to make sure that your inversion code uses something like an LU decomposition. If you use a naïve inversion approach (like I did at first) then that ((o+1)<sup>V</sup>)<sup>3</sup> becomes ((o+1)<sup>V</sup>)!, which was way worse. Before I made that change, I predict that my 5th order 3 variable polynomial would take roughly 400 google millennia to complete. After using LU decomposition, it takes about 7 seconds.</p> <h3>Another disclaimer</h3> <p>This approach requires that (X<sup>T</sup>X) not be a singular matrix (in other words, it can be inverted). My linear algebra is a little rough so I don't know all of the cases where that would occur, but I know that it occurs when there is perfect multi-collinearity between input variables. That means one variable is just another factor multiplied by a constant (for example, one input is number of hours to complete a project and another is dollars to complete a project, but the dollars are just based on an hourly rate times the number of hours).</p> <p>The good news is that when there is perfect multi-collinearity, you'll know. You'll end up with a divide by zero or something when you are inverting the matrix.</p> <p>The bigger problem is when you have imperfect multi-collinearity. That's when you have two closely related but not perfectly related variables (such as temperature and altitude, or speed and mach number). In those cases, this approach still works in theory, but it becomes so sensitive that small floating point errors can cause the result to be WAY off.</p> <p>In my observations, however, the result is either really good or really bad, so you could just set some threshold on your mean squared error and if its over that, then say "couldn't compute a polynomial".</p>

Simple multidimensional curve fitting

Tags:

statistics

regression

best-fit-curve

I have a bunch of data, generally in the form a, b, c, ..., y

where y = f(a, b, c...)

Most of them are three and four variables, and have 10k - 10M records. My general assumption is that they are algebraic in nature, something like:

y = P1 a^E1 + P2 b^E2 + P3 c^E3

Unfortunately, my last statistical analysis class was 20 years ago. What is the easiest way to get a good approximation of f? Open source tools with a very minimal learning curve (i.e. something where I could get a decent approximation in an hour or so) would be ideal. Thanks!

565

asked Feb 09 '09 18:02

user64258

3 Answers

In case it's useful, here's a Numpy/Scipy (Python) template to do what you want:

from numpy import array
from scipy.optimize import leastsq

def __residual(params, y, a, b, c):
    p0, e0, p1, e1, p2, e2 = params
    return p0 * a ** e0 + p1 * b ** e1 + p2 * c ** e2 - y

# load a, b, c
# guess initial values for p0, e0, p1, e1, p2, e2
p_opt = leastsq(__residual,  array([p0, e0, p1, e1, p2, e2]), args=(y, a, b, c))
print 'y = %f a^%f + %f b^%f %f c^%f' % map(float, p_opt)

If you really want to understand what's going on, though, you're going to have to invest the time to scale the learning curve for some tool or programming environment - I really don't think there's any way around that. People don't generally write specialized tools for doing things like 3-term power regressions exclusively.

answered Oct 13 '22 13:10

David Z

The basics of data fitting involve assuming a general form of a solution, guessing some initial values for constants, and then iterating to minimize the error of the guessed solution to find a specific solution, usually in the least-squares sense.

Look into R or Octave for open source tools. They are both capable of least-squares analysis, with several tutorials just a Google search away.

Edit: Octave code for estimating the coefficients for a 2nd order polynomial

x = 0:0.1:10;
y = 5.*x.^2 + 4.*x + 3;

% Add noise to y data
y = y + randn(size(y))*0.1;

% Estimate coefficients of polynomial
p = polyfit(x,y,2)

On my machine, I get:

ans =

   5.0886   3.9050   2.9577

answered Oct 13 '22 15:10

Scottie T

I spent over a week trying to do essentially the same thing. I tried a whole bunch of optimization stuff to fine tune the coefficients with basically no success, then I found out that there is a closed form solution and it works really well.

Disclaimer: I was trying to fit data with a fixed maximum order of magnitude. If there is no limit to your E1, E2, etc values, then this won't work for you.

Now that I've taken the time to learn this stuff, I actually see that some of the answers would have given good hints if I understood them. It had also been a while since my last statistics and linear algebra class.

So if there are other people out there who are lacking the linear algebra knowledge, here's what I did.

Even though this is not a linear function you are trying to fit, it turns out that this is still a linear regression problem. Wikipedia has a really good article on linear regression. I recommend reading it slowly: https://en.wikipedia.org/wiki/Linear_regression#:~:text=In%20statistics%2C%20linear%20regression%20is,as%20dependent%20and%20independent%20variables). It also links a lot of other good related articles.

If you don't know how to do a simple (single variable) linear regression problem using matrices, take some time to learn how to do that.

Once you learn how to do simple linear regression, then try some multivariable linear regression. Basically, to do multi variable linear regression, you create an X matrix where there is a row for each of your input data items and each row contains all of the variable values for that data entry (plus a 1 in the last column which is used for the constant value at the end of your polynomial (called an intercept)). Then you create a Y matrix that is a single column with a row for each data item. Then you solve B = (X^TX)^-1X^TY. B then becomes all of the coefficients for your polynomial.

For multi-variable polynomial regression, its the same idea, just now you have a huge multi-variable linear regression where each regressor (variable you're doing regression on) is a coefficient for your giant polynomial expression.

So if your input data looks like this:

Inputs	Output
a1, b1,	y1
a2, b2,	y2
...	...
aN, bN,	yN

And you want to fit a 2nd order polynomial of the form y = c1a^2b^2 + c2a^2b + c3a^2 + c4ab^2 + c5ab + c6a + c7b^2 + c8b + c9, then your X matrix will look like:


a1^2*b1^2	a1^2*b1	a1^2	a1*b1^2	a1*b1	a1	b1^2	b1	1
a2^2*b2^2	a2^2*b2	a2^2	a2*b1^2	a2*b2	a2	b2^2	b2	1
...	...	...	...	...	...	...	...	...
aN^2*bN^2	aN^2*bN	aN^2	aN*bN^2	aN*bN	aN	bN^2	bN	1

Your Y matrix is simply:


y1
y2
...
yN

Then you do B = (X^TX)^-1X^TY and then B will equal


c1
c2
c3
c4
c5
c6
c7
c8
c9

Note that the total number of coefficients will be (o + 1)^V where o is the order of the polynomial and V is the number of variables, so it grows pretty quickly.

If you are using good matrix code, then I believe the runtime complexity will be O(((o+1)^V)³ + ((o + 1)^V)²N), where V is the number of variables, o is the order of the polynomial, and N is the number of data inputs you have. Initially this sounds pretty terrible, but in most cases, o and V are probably not going to be high, so then this just becomes linear with respect to N.

Note that if you are writing your own matrix code, then it is important to make sure that your inversion code uses something like an LU decomposition. If you use a naïve inversion approach (like I did at first) then that ((o+1)^V)³ becomes ((o+1)^V)!, which was way worse. Before I made that change, I predict that my 5th order 3 variable polynomial would take roughly 400 google millennia to complete. After using LU decomposition, it takes about 7 seconds.

Another disclaimer

This approach requires that (X^TX) not be a singular matrix (in other words, it can be inverted). My linear algebra is a little rough so I don't know all of the cases where that would occur, but I know that it occurs when there is perfect multi-collinearity between input variables. That means one variable is just another factor multiplied by a constant (for example, one input is number of hours to complete a project and another is dollars to complete a project, but the dollars are just based on an hourly rate times the number of hours).

The good news is that when there is perfect multi-collinearity, you'll know. You'll end up with a divide by zero or something when you are inverting the matrix.

The bigger problem is when you have imperfect multi-collinearity. That's when you have two closely related but not perfectly related variables (such as temperature and altitude, or speed and mach number). In those cases, this approach still works in theory, but it becomes so sensitive that small floating point errors can cause the result to be WAY off.

In my observations, however, the result is either really good or really bad, so you could just set some threshold on your mean squared error and if its over that, then say "couldn't compute a polynomial".

answered Oct 13 '22 13:10

NateW

Related questions
                            
                                Supervised Learning for User Behavior over Time
                            
                                words usage database?
                            
                                Correlating word proximity
                            
                                ORDER BY RAND() seems to be less than random
                            
                                Is there a Java library that implements one of the tests for the normality of a sample distribution?
                            
                                Can percentiles of a set of data be calculated in a map-reduce manner?
                            
                                How to calculate error for polynomial fitting (in slope and intercept)
                            
                                Why is my implementation of the parking lot test for random number generators producing bad results?
                            
                                statistics bootstrap library in Python? [closed]
                            
                                manipulate data to better fit a Gaussian Distribution
                            
                                Test for Poisson process
                            
                                Using LongAdder to calculate a max value for a statistical counter?
                            
                                Correlation between two vectors?
                            
                                How can I use sklearn.naive_bayes with (multiple) categorical features? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With