Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can scipy.stats identify and mask obvious outliers?

Tags:

With scipy.stats.linregress I am performing a simple linear regression on some sets of highly correlated x,y experimental data, and initially visually inspecting each x,y scatter plot for outliers. More generally (i.e. programmatically) is there a way to identify and mask outliers?

like image 305
a different ben Avatar asked Apr 19 '12 15:04

a different ben


People also ask

What is the best way to identify outliers in data?

The simplest way to detect an outlier is by graphing the features or the data points. Visualization is one of the best and easiest ways to have an inference about the overall data and the outliers. Scatter plots and box plots are the most preferred visualization tools to detect outliers.

How do you visualize an outlier in Python?

Find outliers in data using a box plot Begin by creating a box plot for the fare_amount column. A box plot allows us to identify the univariate outliers, or outliers for one variable. Box plots are useful because they show minimum and maximum values, the median, and the interquartile range of the data.

Which tool is used to help us identify outliers in a data set?

A scatter plot is useful to find outliers in bivariate data (data with two variables). You can easily spot the outliers because they will be far away from the majority of points on the scatter plot.

What is used to identify outliers?

Interquartile Range (IQR) is important because it is used to define the outliers. It is the difference between the third quartile and the first quartile (IQR = Q3 -Q1).


1 Answers

The statsmodels package has what you need. Look at this little code snippet and its output:

# Imports # import statsmodels.api as smapi import statsmodels.graphics as smgraphics # Make data # x = range(30) y = [y*10 for y in x] # Add outlier # x.insert(6,15) y.insert(6,220) # Make graph # regression = smapi.OLS(x, y).fit() figure = smgraphics.regressionplots.plot_fit(regression, 0) # Find outliers # test = regression.outlier_test() outliers = ((x[i],y[i]) for i,t in enumerate(test) if t[2] < 0.5) print 'Outliers: ', list(outliers) 

Example figure 1

Outliers: [(15, 220)]

Edit

With the newer version of statsmodels, things have changed a bit. Here is a new code snippet that shows the same type of outlier detection.

# Imports # from random import random import statsmodels.api as smapi from statsmodels.formula.api import ols import statsmodels.graphics as smgraphics # Make data # x = range(30) y = [y*(10+random())+200 for y in x] # Add outlier # x.insert(6,15) y.insert(6,220) # Make fit # regression = ols("data ~ x", data=dict(data=y, x=x)).fit() # Find outliers # test = regression.outlier_test() outliers = ((x[i],y[i]) for i,t in enumerate(test.icol(2)) if t < 0.5) print 'Outliers: ', list(outliers) # Figure # figure = smgraphics.regressionplots.plot_fit(regression, 1) # Add line # smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0]) 

Example figure 2

Outliers: [(15, 220)]

like image 110
xApple Avatar answered Sep 17 '22 09:09

xApple