I'm trying to perform what are many iterations of Scipy's curve_fit
at once in order to avoid loops and therefore increase speed.
This is very similar to this problem, which was solved. However, the fact that the functions are piece-wise (discontinuous) makes so that that solution isn't applicable here.
Consider this example:
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
N=20
X=np.logspace(-1,1,N)
Y = np.zeros((4, N))
for i in range(0,4):
b = i+1
a = b
print(a,b)
Y[i] = (X/b)**(-a) #+ 0.01 * rng.randn(6)
Y[i, X>b] = 1
This yields these arrays:
Which as you can see are discontinuous at X==b
. I can retrieve the original values of a
and b
by using curve_fit
iteratively:
def plaw(r, a, b):
""" Theoretical power law for the shape of the normalized conditional density """
import numpy as np
return np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])
coeffs=[]
for ix in range(Y.shape[0]):
print(ix)
c0, pcov = curve_fit(plaw, X, Y[ix])
coeffs.append(c0)
But this process can be very slow depending of the size of X
, Y
and the loop, so I'm trying to speed things up by trying to get coeffs
without the need for a loop. So far I haven't had any luck.
Things that might be important:
X
and Y
only contain positive valuesa
and b
are always positiveEDIT
This is as far as I've gotten:
y=np.ma.masked_where(Y<1.01, Y)
lX = np.log(X)
lY = np.log(y)
A = np.vstack([lX, np.ones(len(lX))]).T
m,c=np.linalg.lstsq(A, lY.T)[0]
print('a=',-m)
print('b=',np.exp(-c/m))
But even without any noise the output is:
a= [0.18978965578339158 1.1353633705997466 2.220234483915197 3.3324502660995714]
b= [339.4090881838179 7.95073481873057 6.296592007396107 6.402567167503574]
Which is way worse than I was hoping to get.
Here are three approaches to speeding this up. You gave no desired speed up or accuracies, or even vector sizes, so buyer beware.
Timings:
len 1 2 3 4
1000 0.045 0.033 0.025 0.022
10000 0.290 0.097 0.029 0.023
100000 3.429 0.767 0.083 0.030
1000000 0.546 0.046
1) Original Method
2) Pre-estimate with Subset
3) M Newville [linear log-log estimate](https://stackoverflow.com/a/44975066/7311767)
4) Subset Estimate (Use Less Data)
A decent speedup can be achieved by simply running the curve_fit
twice, where the first time uses a short subset of the data to get a quick estimate. That estimate is then used to seed a curve_fit
with the entire dataset.
x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
return curve_fit(power_law, x, y, p0=c0)[0]
Using the log estimate proposed by M Newville, is also considerably faster. As the OP was concerned about the initial estimate method proposed by Newville, this method uses curve_fit
with a subset to provide the estimate of the break point in the curve.
x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
return -result[0], np.exp(-result[1] / result[0])
return (m, c), result
Finally the seed mechanism used for the previous two methods provides pretty good estimates on the sample data. Of course it is sample data so your mileage may vary.
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
from scipy.stats import linregress
fit_data = {}
current_data = None
def data_for_fit(a, b, n):
key = a, b, n
if key not in fit_data:
rng.seed(0)
x = np.logspace(-1, 1, n)
y = np.clip((x / b) ** (-a) + 0.01 * rng.randn(n), 0.001, None)
y[x > b] = 1
fit_data[key] = x, y
return fit_data[key]
def power_law(r, a, b):
""" Power law for the shape of the normalized conditional density """
import numpy as np
return np.piecewise(
r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])
def method1():
x, y = current_data
return curve_fit(power_law, x, y)[0]
def method2():
x, y = current_data
return curve_fit(power_law, x, y, p0=method4()[0])
def method3():
x, y = current_data
c0, pcov = method4()
index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
m, c = -result[0], np.exp(-result[1] / result[0])
return (m, c), result
def method4():
x, y = current_data
stride = int(max(1, len(x) / 200))
return curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])
from timeit import timeit
def runit(stmt):
print("%s: %.3f %s" % (
stmt, timeit(stmt + '()', number=10,
setup='from __main__ import ' + stmt),
eval(stmt + '()')[0]
))
def runit_size(size):
print('Length: %d' % size)
if size <= 100000:
runit('method1')
runit('method2')
runit('method3')
runit('method4')
for i in (1000, 10000, 100000, 1000000):
current_data = data_for_fit(3, 3, i)
runit_size(i)
Two suggestions:
numpy.where
(and possibly argmin
) to find the X
value at which the Y
data becomes 1, or perhaps just slightly larger than 1, and truncate the data to that point -- effectively ignoring the data where Y=1. That might be something like:
index_max = numpy.where(y < 1.2)[0][0]
x = y[:index_max]
y = y[:index_max]
curve_fit
, but can use scipy.stats.linregress
on log(Y)
vs log(Y)
. For your real work, that will at the very least give good starting values for a subsequent fit.Following up on this and trying to follow your question, you might try something like:
import numpy as np
from scipy.stats import linregress
np.random.seed(0)
npts = 51
x = np.logspace(-2, 2, npts)
YTHRESH = 1.02
for i in range(5):
b = i + 1.0 + np.random.normal(scale=0.1)
a = b + np.random.random()
y = (x/b)**(-a) + np.random.normal(scale=0.0030, size=npts)
y[x>b] = 1.0
# to model exponential decay, first remove the values
# where y ~= 1 where the data is known to not decay...
imax = np.where(y < YTHRESH)[0][0]
# take log of this truncated x and y
_x = np.log(x[:imax])
_y = np.log(y[:imax])
# use linear regression on the log-log data:
out = linregress(_x, _y)
# map slope/intercept to scale, exponent
afit = -out.slope
bfit = np.exp(out.intercept/afit)
print(""" === Fit Example {i:3d}
a expected {a:4f}, got {afit:4f}
b expected {b:4f}, got {bfit:4f}
""".format(i=i+1, a=a, b=b, afit=afit, bfit=bfit))
Hopefully that's enough to get you going.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With