Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

matplotlib: disregard outliers when plotting

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.

With matplotlib, I plot against the range [0, max_data_value]

How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?

Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?

like image 629
Ricky Robinson Avatar asked Aug 09 '12 11:08

Ricky Robinson


People also ask

How do you remove outliers from a scatter plot in Python?

Remove the Outliers From the DataFrame in Python We will use the dataframe. drop function to drop the outlier points. Click here to more information about the function. To verify whether or not the points have been dropped, we can print the shape of our data to see the number of entries remaining.

How do you remove outliers from a box plot?

We can remove outliers in R by setting the outlier. shape argument to NA. In addition, the coord_cartesian() function will be used to reject all outliers that exceed or below a given quartile. The y-axis of ggplot2 is not automatically adjusted.


2 Answers

There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").

Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.

Here's a function that implements one of the more common outlier tests.

def is_outlier(points, thresh=3.5):     """     Returns a boolean array with True if points are outliers and False      otherwise.      Parameters:     -----------         points : An numobservations by numdimensions array of observations         thresh : The modified z-score to use as a threshold. Observations with             a modified z-score (based on the median absolute deviation) greater             than this value will be classified as outliers.      Returns:     --------         mask : A numobservations-length boolean array.      References:     ----------         Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and         Handle Outliers", The ASQC Basic References in Quality Control:         Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.      """     if len(points.shape) == 1:         points = points[:,None]     median = np.median(points, axis=0)     diff = np.sum((points - median)**2, axis=-1)     diff = np.sqrt(diff)     med_abs_deviation = np.median(diff)      modified_z_score = 0.6745 * diff / med_abs_deviation      return modified_z_score > thresh 

As an example of using it, you'd do something like the following:

import numpy as np import matplotlib.pyplot as plt  # The function above... In my case it's in a local utilities module from sci_utilities import is_outlier  # Generate some data x = np.random.random(100)  # Append a few "bad" points x = np.r_[x, -3, -10, 100]  # Keep only the "good" points # "~" operates as a logical not operator on boolean numpy arrays filtered = x[~is_outlier(x)]  # Plot the results fig, (ax1, ax2) = plt.subplots(nrows=2)  ax1.hist(x) ax1.set_title('Original')  ax2.hist(filtered) ax2.set_title('Without Outliers')  plt.show() 

enter image description here

like image 132
Joe Kington Avatar answered Sep 23 '22 05:09

Joe Kington


If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:

plt.xlim(min_x_data_value,max_x_data_value) 

Where the values are your desired limits to display.

plt.ylim(min,max) works to set limits on the y axis also.

like image 34
Jdog Avatar answered Sep 23 '22 05:09

Jdog