Remove data points below a curve with python

Question

I need to compare some theoretical data with real data in python. The theoretical data comes from resolving an equation. To improve the comparative I would like to remove data points that fall far from the theoretical curve. I mean, I want to remove the points below and above red dashed lines in the figure (made with matplotlib). Data points and theoretical curves

Both the theoretical curves and the data points are arrays of different length.

I can try to remove the points in a roughly-eye way, for example: the first upper point can be detected using:

data2[(data2.redshift<0.4)&data2.dmodulus>1]
rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')])

But I would like to use a less roughly-eye way.

So, can anyone help me finding an easy way of removing the problematic points?

Thank you!

Yann · Accepted Answer

This might be overkill and is based on your comment

Both the theoretical curves and the data points are arrays of different length.

I would do the following:

Truncate the data set so that its x values lie within the max and min values of the theoretical set.
Interpolate the theoretical curve using scipy.interpolate.interp1d and the above truncated data x values. The reason for step (1) is to satisfy the constraints of interp1d.
Use numpy.where to find data y values that are out side the range of acceptable theory values.
DONT discard these values, as was suggested in comments and other answers. If you want for clarity, point them out by plotting the 'inliners' one color and the 'outliers' an other color.

Here's a script that is close to what you are looking for, I think. It hopefully will help you accomplish what you want:

import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt

# make up data
def makeUpData():
    '''Make many more data points (x,y,yerr) than theory (x,y),
    with theory yerr corresponding to a constant "sigma" in y, 
    about x,y value'''
    NX= 150
    dataX = (np.random.rand(NX)*1.1)**2
    dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX
    dataErr = np.random.rand(NX)*dataX*1.3
    theoryX = np.arange(0,1,0.1)
    theoryY = theoryX*theoryX*1.5
    theoryErr = 0.5
    return dataX,dataY,dataErr,theoryX,theoryY,theoryErr

def makeSameXrange(theoryX,dataX,dataY):
    '''
    Truncate the dataX and dataY ranges so that dataX min and max are with in
    the max and min of theoryX.
    '''
    minT,maxT = theoryX.min(),theoryX.max()
    goodIdxMax = np.where(dataX<maxT)
    goodIdxMin = np.where(dataX[goodIdxMax]>minT)
    return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin]

# take 'theory' and get values at every 'data' x point
def theoryYatDataX(theoryX,theoryY,dataX):
    '''For every dataX point, find interpolated thoeryY value. theoryx needed
    for interpolation.'''
    f = interpolate.interp1d(theoryX,theoryY)
    return f(dataX[np.where(dataX<np.max(theoryX))])

# collect valid points
def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr):
    '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
    valid indicies.'''
    withinUpper = np.where(dataY<(interpTheoryY+theoryErr))
    withinLower = np.where(dataY[withinUpper]
                    >(interpTheoryY[withinUpper]-theoryErr))
    return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower]

def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr):
    '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
    valid indicies.'''
    withinUpper = np.where(dataY>(interpTheoryY+theoryErr))
    withinLower = np.where(dataY<(interpTheoryY-theoryErr))
    return (dataX[withinUpper],dataY[withinUpper],
            dataX[withinLower],dataY[withinLower])
if __name__ == "__main__":

    dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData()

    TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY)

    interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX)

    inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY,
                                    theoryErr)

    outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX,
                                                    TruncDataY,
                                                    interpTheoryY,
                                                    theoryErr)
    #print inlierIndex
    fig = plt.figure()
    ax = fig.add_subplot(211)

    ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k')
    ax.plot(theoryX,theoryY,'r-')
    ax.plot(theoryX,theoryY+theoryErr,'r--')
    ax.plot(theoryX,theoryY-theoryErr,'r--')
    ax.set_xlim(0,1.4)
    ax.set_ylim(-.5,3)
    ax = fig.add_subplot(212)

    ax.plot(inDataX,inDataY,'ko')
    ax.plot(outUpX,outUpY,'bo')
    ax.plot(outDownX,outDownY,'ro')
    ax.plot(theoryX,theoryY,'r-')
    ax.plot(theoryX,theoryY+theoryErr,'r--')
    ax.plot(theoryX,theoryY-theoryErr,'r--')
    ax.set_xlim(0,1.4)
    ax.set_ylim(-.5,3)
    fig.savefig('findInliers.png')

This figure is the result: enter image description here

Remove data points below a curve with python

Tags:

python

matplotlib

scipy

scientific-computing

Illa Rivero Losada

1 Answers

Yann

Recent Activity

Donate For Us

Remove data points below a curve with python

Tags:

python

matplotlib

scipy

scientific-computing

Illa Rivero Losada

1 Answers

Yann

Related questions

Recent Activity

Donate For Us