Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a specific list of bins for multiple histograms from DataFrame, when using plotly+cufflinks?

It is relatively easy to manually give a list of bins when plotting an histogram with matplotlib, as shown for example here.

A simple example of this is the following:

import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(np.random.randn(10000), bins=np.arange(-4, 4, 0.1))
ax.hist(0.2 * np.random.randn(10000), bins=np.arange(-4, 4, 0.1))
plt.show()

This can also be equivalently done from a pandas.DataFrame with:

pd.DataFrame({
    'firstHistogram': np.random.randn(10000),
    'secondHistogram': 0.2 * np.random.randn(10000)
}).plot(kind='hist', bins=np.arange(-4, 4, 0.1))

Going further, plotly allows to directly interface to pandas through the cufflinks module, which allows to do things like the following:

pd.DataFrame({
    'firstHistogram': np.random.randn(10000),
    'secondHistogram': 0.2 * np.random.randn(10000)
}).iplot(kind='hist', bins=100)

enter image description here

But here is the catch: the iplot method provided by cufflinks does not seem to accept a list for bins. When a number is provided like in the above example, that number is used to bin independently both datasets, which results in unequal binning, with potentially misleading results (see the equal heights in the above plot).

While this effect can be somewhat mitigated using the histnorm='density' option, one may want to see the counts per bin and not a density.

Is there a way around this?

like image 259
glS Avatar asked Mar 09 '23 07:03

glS


2 Answers

I have added an update for this. You should be now able specify bins=(start,end,size)

pd.DataFrame({
'firstHistogram': np.random.randn(10000),
'secondHistogram': 0.2 * np.random.randn(10000)}).iplot(kind='hist',bins=(-4,4,.08))

Should now return: Custom bins

like image 132
jorge.santos Avatar answered Apr 29 '23 21:04

jorge.santos


As far as I know there is no direct way of doing it in cufflinks. The output shown in your code is wrong in my opinion, i.e. I think that's a bug in cufflinks.

But you can easily imitate the cufflinks function with a few lines of code. You can get the same layout with cufflinks.getLayout() and just need to set barmode to overlay.

enter image description here

import pandas as pd
import plotly
import cufflinks

plotly.offline.init_notebook_mode()

pd.DataFrame({
    'firstHistogram': np.random.randn(10000),
    'secondHistogram': 0.2 * np.random.randn(10000)
})

data = list()

for dd in df:
    histo = plotly.graph_objs.Histogram(x=df[dd], 
                                        name=dd,
                                        xbins={'start': -4, 'end': 4, 'size': 0.08},
                                        autobinx=False, 
                                        opacity=0.8
                                       )
    data.append(histo)
layout = plotly.graph_objs.Layout(cufflinks.getLayout(), 
                                  barmode='overlay')
fig = plotly.graph_objs.Figure(data=data, 
                               layout=layout)
plotly.offline.iplot(fig)
like image 45
Maximilian Peters Avatar answered Apr 29 '23 21:04

Maximilian Peters