With scipy.stats.linregress I am performing a simple linear regression on some sets of highly correlated x,y experimental data, and initially visually inspecting each x,y scatter plot for outliers. More generally (i.e. programmatically) is there a way to identify and mask outliers?
The simplest way to detect an outlier is by graphing the features or the data points. Visualization is one of the best and easiest ways to have an inference about the overall data and the outliers. Scatter plots and box plots are the most preferred visualization tools to detect outliers.
Find outliers in data using a box plot Begin by creating a box plot for the fare_amount column. A box plot allows us to identify the univariate outliers, or outliers for one variable. Box plots are useful because they show minimum and maximum values, the median, and the interquartile range of the data.
A scatter plot is useful to find outliers in bivariate data (data with two variables). You can easily spot the outliers because they will be far away from the majority of points on the scatter plot.
Interquartile Range (IQR) is important because it is used to define the outliers. It is the difference between the third quartile and the first quartile (IQR = Q3 -Q1).
The statsmodels
package has what you need. Look at this little code snippet and its output:
# Imports # import statsmodels.api as smapi import statsmodels.graphics as smgraphics # Make data # x = range(30) y = [y*10 for y in x] # Add outlier # x.insert(6,15) y.insert(6,220) # Make graph # regression = smapi.OLS(x, y).fit() figure = smgraphics.regressionplots.plot_fit(regression, 0) # Find outliers # test = regression.outlier_test() outliers = ((x[i],y[i]) for i,t in enumerate(test) if t[2] < 0.5) print 'Outliers: ', list(outliers)
Outliers: [(15, 220)]
With the newer version of statsmodels
, things have changed a bit. Here is a new code snippet that shows the same type of outlier detection.
# Imports # from random import random import statsmodels.api as smapi from statsmodels.formula.api import ols import statsmodels.graphics as smgraphics # Make data # x = range(30) y = [y*(10+random())+200 for y in x] # Add outlier # x.insert(6,15) y.insert(6,220) # Make fit # regression = ols("data ~ x", data=dict(data=y, x=x)).fit() # Find outliers # test = regression.outlier_test() outliers = ((x[i],y[i]) for i,t in enumerate(test.icol(2)) if t < 0.5) print 'Outliers: ', list(outliers) # Figure # figure = smgraphics.regressionplots.plot_fit(regression, 1) # Add line # smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])
Outliers: [(15, 220)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With