Making pyplot.hist() first and last bins include outliers

Tags:

pyplot.hist() documentation specifies that when setting a range for a histogram "lower and upper outliers are ignored".

Is it possible to make the first and last bins of a histogram include all outliers without changing the width of the bin?

For example, let's say I want to look at the range 0-3 with 3 bins: 0-1, 1-2, 2-3 (let's ignore cases of exact equality for simplicity). I would like the first bin to include all values from minus infinity to 1, and the last bin to include all values from 2 to infinity. However, if I explicitly set these bins to span that range, they will be very wide. I would like them to have the same width. The behavior I am looking for is like the behavior of hist() in Matlab.

Obviously I can numpy.clip() the data and plot that, which will give me what I want. But I am interested if there is a builtin solution for this.

794

asked Apr 05 '13 15:04

Bitwise

2 Answers

I was also struggling with this, and didn't want to use .clip() because it could be misleading, so I wrote a little function (borrowing heavily from this) to indicate that the upper and lower bins contained outliers:

def outlier_aware_hist(data, lower=None, upper=None):
    if not lower or lower < data.min():
        lower = data.min()
        lower_outliers = False
    else:
        lower_outliers = True

    if not upper or upper > data.max():
        upper = data.max()
        upper_outliers = False
    else:
        upper_outliers = True

    n, bins, patches = plt.hist(data, range=(lower, upper), bins='auto')

    if lower_outliers:
        n_lower_outliers = (data < lower).sum()
        patches[0].set_height(patches[0].get_height() + n_lower_outliers)
        patches[0].set_facecolor('c')
        patches[0].set_label('Lower outliers: ({:.2f}, {:.2f})'.format(data.min(), lower))

    if upper_outliers:
        n_upper_outliers = (data > upper).sum()
        patches[-1].set_height(patches[-1].get_height() + n_upper_outliers)
        patches[-1].set_facecolor('m')
        patches[-1].set_label('Upper outliers: ({:.2f}, {:.2f})'.format(upper, data.max()))

    if lower_outliers or upper_outliers:
        plt.legend()

You can also combine it with an automatic outlier detector (borrowed from here) like so:

def mad(data):
    median = np.median(data)
    diff = np.abs(data - median)
    mad = np.median(diff)
    return mad

def calculate_bounds(data, z_thresh=3.5):
    MAD = mad(data)
    median = np.median(data)
    const = z_thresh * MAD / 0.6745
    return (median - const, median + const)

outlier_aware_hist(data, *calculate_bounds(data))

Generated data from a standard normal and then added some outliers. Plots with and without outlier binning.

answered Sep 23 '22 19:09

Benjamin Doughty

No. Looking at matplotlib.axes.Axes.hist and the direct use of numpy.histogram I'm fairly confident in saying that there is no smarter solution than using clip (other than extending the bins that you histogram with).

I'd encourage you to look at the source of matplotlib.axes.Axes.hist (it's just Python code, though admittedly hist is slightly more complex than most of the Axes methods) - it is the best way to verify this kind of question.

answered Sep 23 '22 19:09

pelson

Related questions
                            
                                Determine if an attribute is a `DeferredAttribute` in django
                            
                                How to set a value in a pandas DataFrame by mixed iloc and loc
                            
                                Generator expression must be parenthesized if not sole argument
                            
                                Get row index from DataFrame row
                            
                                Pandas' equivalent of resample for integer index
                            
                                How to profile multiple subprocesses using Python multiprocessing and memory_profiler?
                            
                                At which moment and how often are executed the __init__.py files by python
                            
                                Pandas multiply dataframes with multiindex and overlapping index levels
                            
                                Prevent package from being installed on old Python versions
                            
                                What is the difference between partial and partialmethod?
                            
                                Live updating only the data in Dash/plotly
                            
                                Poisson Regression in statsmodels and R
                            
                                What is the n parameter of tkinter.mainloop function?
                            
                                Graph disconnected: cannot obtain value for tensor Tensor Input Keras Python
                            
                                Distribute pre-compiled python extension module with distutils
                            
                                Send Ctrl-C to remote processes started via subprocess.Popen and ssh
                            
                                Using git to manage virtualenv state: will this cause problems?
                            
                                python multiprocessing arguments: deep copy?
                            
                                `DummyExecutor` for Python's `futures`
                            
                                How to use SQLAlchemy to seamlessly access multiple databases?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Making pyplot.hist() first and last bins include outliers

Tags:

python

matplotlib

numpy

Bitwise

People also ask

2 Answers

Benjamin Doughty

pelson

Recent Activity

Donate For Us