Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scipy.interpolate.UnivariateSpline not smoothing regardless of parameters

I'm having trouble getting scipy.interpolate.UnivariateSpline to use any smoothing when interpolating. Based on the function's page as well as some previous posts, I believe it should provide smoothing with the s parameter.

Here is my code:

# Imports
import scipy
import pylab

# Set up and plot actual data
x = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
y = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]
pylab.plot(x, y, "o", label="Actual")

# Plot estimates using splines with a range of degrees
for k in range(1, 4):
    mySpline = scipy.interpolate.UnivariateSpline(x=x, y=y, k=k, s=2)
    xi = range(0, 15100, 20)
    yi = mySpline(xi)
    pylab.plot(xi, yi, label="Predicted k=%d" % k)

# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()

Here is the result:

Splines without smoothing

I have tried this with a range of s values (0.01, 0.1, 1, 2, 5, 50), as well as explicit weights, set to either the same thing (1.0) or randomized. I still can't get any smoothing, and the number of knots is always the same as the number of data points. In particular, I'm looking for outliers like that 4th point (7990.4664106277542, 5851.6866463790966) to be smoothed over.

Is it because I don't have enough data? If so, is there a similar spline function or cluster technique I can apply to achieve smoothing with this few datapoints?

like image 782
BigChief Avatar asked Dec 07 '22 17:12

BigChief


2 Answers

Short answer: you need to choose the value for s more carefully.

The documentation for UnivariateSpline states that:

Positive smoothing factor used to choose the number of knots. Number of 
knots will be increased until the     smoothing condition is satisfied:
sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s

From this one can deduce that "reasonable" values for smoothing, if you don't pass in explicit weights, are around s = m * v where m is the number of data points and v the variance of the data. In this case, s_good ~ 5e7.

EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.

like image 96
pv. Avatar answered Jan 17 '23 15:01

pv.


@Zhenya's answer of manually setting knots in between datapoints was too rough to deliver good results in noisy data without being selective about how this technique is applied. However, inspired by his/her suggestion, I have had success with Mean-Shift clustering from the scikit-learn package. It performs auto-determination of the cluster count and seems to do a fairly good smoothing job (very smooth in fact).

# Imports
import numpy
import pylab
import scipy
import sklearn.cluster

# Set up original data - note that it's monotonically increasing by X value!
data = {}
data['original'] = {}
data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193]
data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598]

# Cluster data, sort it and and save
inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))])
meanShift = sklearn.cluster.MeanShift()
meanShift.fit(inputNumpy)
clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_]
clusteredData.sort(lambda pair1, pair2: cmp(pair1[0],pair2[0]))
data['clustered'] = {}
data['clustered']['x'] = [pair[0] for pair in clusteredData]
data['clustered']['y'] = [pair[1] for pair in clusteredData]

# Build a spline using the clustered data and predict
mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1)
xi = range(0, round(max(data['original']['x']), -3) + 3000, 20)
yi = mySpline(xi)

# Plot the datapoints
pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered')
pylab.plot(xi, yi, label="Predicted (%s)" %  'clustered')
pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original')

# Show the plot
pylab.grid(True)
pylab.xticks(rotation=45)
pylab.legend( loc="lower right" )
pylab.show()

enter image description here

like image 30
BigChief Avatar answered Jan 17 '23 15:01

BigChief