Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas plot hist sharex=False does not behave as expected

I am trying to plot histograms of a couple of series from a dataframe. Series have different maximum values:

df[[
    'age_sent', 'last_seen', 'forum_reply', 'forum_cnt', 'forum_exp', 'forum_quest'
]].max()

returns:

age_sent       1516.564016
last_seen       986.790035
forum_reply     137.000000
forum_cnt       155.000000
forum_exp        13.000000
forum_quest      10.000000

When I tried to plot histograms I use sharex=False, subplots=True but it looks like sharex property is ignored:

df[[
    'age_sent', 'last_seen', 'forum_reply', 'forum_cnt', 'forum_exp', 'forum_quest'
]].plot.hist(figsize=(20, 10), logy=True, sharex=False, subplots=True)

enter image description here


I can clearly plot each of them separately, but this is less desirable. Also I would like to know what I am doing wrong.


The data I have is too big too be included, but it is easy to create something similar:

ttt = pd.DataFrame({'a': pd.Series(np.random.uniform(1, 1000, 100)), 'b': pd.Series(np.random.uniform(1, 10, 100))})

Now I have:

ttt.plot.hist(logy=True, sharex=False, subplots=True)

Check the x axis. I want it to be this way (but using one command with subplots).

ttt['a'].plot.hist(logy=True)
ttt['b'].plot.hist(logy=True)
like image 864
Salvador Dali Avatar asked Sep 01 '16 04:09

Salvador Dali


People also ask

What does HIST do in pandas?

hist() function provides the ability to plot separate histograms in pandas for different groups of data. By using the 'by' parameter, you can specify the column name for which different groups should be made. This will create separate histograms for each group.

What are pandas Xticks?

xticks : sequence. Values to use for the xticks. yticks : sequence. Values to use for the yticks. xlim : 2-tuple/list.


2 Answers

The sharex (most likely) just falls through to mpl and sets if the panning / zooming one axes changes the other.

The issue you are having is that the same bins are being used for all of the histograms (which is enforced by https://github.com/pydata/pandas/blob/master/pandas/tools/plotting.py#L2053 if I am understanding the code correctly) because pandas assumes that if you multiple histograms then you are probably plotting columns of similar data so using the same binning makes them comparable.

Assuming you have mpl >= 1.5 and numpy >= 1.11 you should write your self a little helper function like

import matplotlib.pyplot as plt
import matplotlib as mpl 
import pandas as pd
import numpy as np

plt.ion()


def make_hists(df, fig_kwargs=None, hist_kwargs=None,
               style_cycle=None):
    '''

    Parameters
    ----------
    df : pd.DataFrame
        Datasource

    fig_kwargs : dict, optional
        kwargs to pass to `plt.subplots`

        defaults to {'fig_size': (4, 1.5*len(df.columns),
                     'tight_layout': True}

    hist_kwargs : dict, optional
        Extra kwargs to pass to `ax.hist`, defaults
        to `{'bins': 'auto'}

    style_cycle : cycler
        Style cycle to use, defaults to 
        mpl.rcParams['axes.prop_cycle']

    Returns
    -------
    fig : mpl.figure.Figure
        The figure created

    ax_list : list
        The mpl.axes.Axes objects created 

    arts : dict 
        maps column names to the histogram artist
    '''
    if style_cycle is None:
        style_cycle = mpl.rcParams['axes.prop_cycle']

    if fig_kwargs is None:
        fig_kwargs = {}
    if hist_kwargs is None:
        hist_kwargs = {}

    hist_kwargs.setdefault('log', True)
    # this requires nmupy >= 1.11
    hist_kwargs.setdefault('bins', 'auto')
    cols = df.columns

    fig_kwargs.setdefault('figsize', (4, 1.5*len(cols)))
    fig_kwargs.setdefault('tight_layout', True)
    fig, ax_lst = plt.subplots(len(cols), 1, **fig_kwargs)
    arts = {}
    for ax, col, sty in zip(ax_lst, cols, style_cycle()):
        h = ax.hist(col, data=df, **hist_kwargs, **sty)
        ax.legend()

        arts[col] = h

    return fig, list(ax_lst), arts

dist = [1, 2, 5, 7, 50]
col_names = ['weibull $a={}$'.format(alpha) for alpha in dist]
test_df = pd.DataFrame(np.random.weibull(dist,
                                         (10000, len(dist))),
                       columns=col_names)

make_hists(test_df)

enter image description here

like image 104
tacaswell Avatar answered Oct 09 '22 11:10

tacaswell


The current answer works, but there is an easier workaround in recent versions.

While df.plot.hist does not respect sharex=False, df.plot.density does.

dist = [1, 2, 7, 50]
col_names = ['weibull $a={}$'.format(alpha) for alpha in dist]
test_df = pd.DataFrame(np.random.weibull(dist,
                                         (10000, len(dist))),
                       columns=col_names)

test_df.plot.density(subplots=True, sharex=False, sharey=False, layout=(-1, 2))

density plots respect sharex

like image 3
hume Avatar answered Oct 09 '22 12:10

hume