I have a bunch of data, generally in the form a, b, c, ..., y
where y = f(a, b, c...)
Most of them are three and four variables, and have 10k - 10M records. My general assumption is that they are algebraic in nature, something like:
y = P1 a^E1 + P2 b^E2 + P3 c^E3
Unfortunately, my last statistical analysis class was 20 years ago. What is the easiest way to get a good approximation of f? Open source tools with a very minimal learning curve (i.e. something where I could get a decent approximation in an hour or so) would be ideal. Thanks!
Curve Fitting using Polynomial Terms in Linear Regression Despite its name, you can fit curves using linear regression. The most common method is to include polynomial terms in the linear model. Polynomial terms are independent variables that you raise to a power, such as squared or cubed terms.
For above example, x = v and y = p. The process of finding the equation of the curve of best fit, which may be most suitable for predicting the unknown values, is known as curve fitting. Therefore, curve fitting means an exact relationship between two variables by algebraic equations.
The highest-order polynomial that Trendline can use as a fitting function is a regular polynomial of order six, i.e., y = ax6 + bx5 +cx4 + ak3 + ex2 +fx + g. polynomials such as y = ax2 + bx3'2 + cx + + e.
In case it's useful, here's a Numpy/Scipy (Python) template to do what you want:
from numpy import array
from scipy.optimize import leastsq
def __residual(params, y, a, b, c):
p0, e0, p1, e1, p2, e2 = params
return p0 * a ** e0 + p1 * b ** e1 + p2 * c ** e2 - y
# load a, b, c
# guess initial values for p0, e0, p1, e1, p2, e2
p_opt = leastsq(__residual, array([p0, e0, p1, e1, p2, e2]), args=(y, a, b, c))
print 'y = %f a^%f + %f b^%f %f c^%f' % map(float, p_opt)
If you really want to understand what's going on, though, you're going to have to invest the time to scale the learning curve for some tool or programming environment - I really don't think there's any way around that. People don't generally write specialized tools for doing things like 3-term power regressions exclusively.
The basics of data fitting involve assuming a general form of a solution, guessing some initial values for constants, and then iterating to minimize the error of the guessed solution to find a specific solution, usually in the least-squares sense.
Look into R or Octave for open source tools. They are both capable of least-squares analysis, with several tutorials just a Google search away.
Edit: Octave code for estimating the coefficients for a 2nd order polynomial
x = 0:0.1:10;
y = 5.*x.^2 + 4.*x + 3;
% Add noise to y data
y = y + randn(size(y))*0.1;
% Estimate coefficients of polynomial
p = polyfit(x,y,2)
On my machine, I get:
ans =
5.0886 3.9050 2.9577
I spent over a week trying to do essentially the same thing. I tried a whole bunch of optimization stuff to fine tune the coefficients with basically no success, then I found out that there is a closed form solution and it works really well.
Disclaimer: I was trying to fit data with a fixed maximum order of magnitude. If there is no limit to your E1, E2, etc values, then this won't work for you.
Now that I've taken the time to learn this stuff, I actually see that some of the answers would have given good hints if I understood them. It had also been a while since my last statistics and linear algebra class.
So if there are other people out there who are lacking the linear algebra knowledge, here's what I did.
Even though this is not a linear function you are trying to fit, it turns out that this is still a linear regression problem. Wikipedia has a really good article on linear regression. I recommend reading it slowly: https://en.wikipedia.org/wiki/Linear_regression#:~:text=In%20statistics%2C%20linear%20regression%20is,as%20dependent%20and%20independent%20variables). It also links a lot of other good related articles.
If you don't know how to do a simple (single variable) linear regression problem using matrices, take some time to learn how to do that.
Once you learn how to do simple linear regression, then try some multivariable linear regression. Basically, to do multi variable linear regression, you create an X matrix where there is a row for each of your input data items and each row contains all of the variable values for that data entry (plus a 1 in the last column which is used for the constant value at the end of your polynomial (called an intercept)). Then you create a Y matrix that is a single column with a row for each data item. Then you solve B = (XTX)-1XTY. B then becomes all of the coefficients for your polynomial.
For multi-variable polynomial regression, its the same idea, just now you have a huge multi-variable linear regression where each regressor (variable you're doing regression on) is a coefficient for your giant polynomial expression.
So if your input data looks like this:
Inputs | Output |
---|---|
a1, b1, | y1 |
a2, b2, | y2 |
... | ... |
aN, bN, | yN |
And you want to fit a 2nd order polynomial of the form y = c1a^2b^2 + c2a^2b + c3a^2 + c4ab^2 + c5ab + c6a + c7b^2 + c8b + c9, then your X matrix will look like:
a1^2*b1^2 | a1^2*b1 | a1^2 | a1*b1^2 | a1*b1 | a1 | b1^2 | b1 | 1 |
a2^2*b2^2 | a2^2*b2 | a2^2 | a2*b1^2 | a2*b2 | a2 | b2^2 | b2 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
aN^2*bN^2 | aN^2*bN | aN^2 | aN*bN^2 | aN*bN | aN | bN^2 | bN | 1 |
Your Y matrix is simply:
y1 |
y2 |
... |
yN |
Then you do B = (XTX)-1XTY and then B will equal
c1 |
c2 |
c3 |
c4 |
c5 |
c6 |
c7 |
c8 |
c9 |
Note that the total number of coefficients will be (o + 1)V where o is the order of the polynomial and V is the number of variables, so it grows pretty quickly.
If you are using good matrix code, then I believe the runtime complexity will be O(((o+1)V)3 + ((o + 1)V)2N), where V is the number of variables, o is the order of the polynomial, and N is the number of data inputs you have. Initially this sounds pretty terrible, but in most cases, o and V are probably not going to be high, so then this just becomes linear with respect to N.
Note that if you are writing your own matrix code, then it is important to make sure that your inversion code uses something like an LU decomposition. If you use a naïve inversion approach (like I did at first) then that ((o+1)V)3 becomes ((o+1)V)!, which was way worse. Before I made that change, I predict that my 5th order 3 variable polynomial would take roughly 400 google millennia to complete. After using LU decomposition, it takes about 7 seconds.
This approach requires that (XTX) not be a singular matrix (in other words, it can be inverted). My linear algebra is a little rough so I don't know all of the cases where that would occur, but I know that it occurs when there is perfect multi-collinearity between input variables. That means one variable is just another factor multiplied by a constant (for example, one input is number of hours to complete a project and another is dollars to complete a project, but the dollars are just based on an hourly rate times the number of hours).
The good news is that when there is perfect multi-collinearity, you'll know. You'll end up with a divide by zero or something when you are inverting the matrix.
The bigger problem is when you have imperfect multi-collinearity. That's when you have two closely related but not perfectly related variables (such as temperature and altitude, or speed and mach number). In those cases, this approach still works in theory, but it becomes so sensitive that small floating point errors can cause the result to be WAY off.
In my observations, however, the result is either really good or really bad, so you could just set some threshold on your mean squared error and if its over that, then say "couldn't compute a polynomial".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With