Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

interactive conditional histogram bucket slicing data visualization

I have a df that looks like:

df.head()
Out[1]:
        A   B   C
city0   40  12  73
city1   65  56  10
city2   77  58  71
city3   89  53  49
city4   33  98  90

An example df can be created by the following code:

df = pd.DataFrame(np.random.randint(100,size=(1000000,3)), columns=list('ABC'))

indx = ['city'+str(x) for x in range(0,1000000)]
df.index = indx

What I want to do is:

a) determine appropriate histogram bucket lengths for column A and assign each city to a bucket for column A

b) determine appropriate histogram bucket lengths for column B and assign each city to a bucket for column B

Maybe the resulting df looks like (or is there a better built in way in pandas?)

    df.head()
    Out[1]:
            A   B   C  Abkt Bbkt
    city0   40  12  73  2  1
    city1   65  56  10  4  3
    city2   77  58  71  4  3
    city3   89  53  49  5  3
    city4   33  98  90  2  5

Where Abkt and Bbkt are histogram bucket identifiers:

1-20 = 1
21-40 = 2
41-60 = 3
61-80 = 4
81-100 = 5

Ultimately, I want to better understand the behavior of each city with respect to columns A, B and C and be able to answer questions like:

a) What does the distribution of Column A (or B) look like - i.e. what buckets are most/least populated.

b) Conditional on a particular slice/bucket of Column A, what does the distribution of Column B look like - i.e. what buckets are most/least populated.

c) Conditional on a particular slice/bucket of Column A and B, what does the behavior of C look like.

Ideally, I want to be able to visualize the data (heat maps, region identifiers etc). I'm a relative pandas/python newbie and don't know what is possible to develop.

If the SO community can kindly provide code examples of how I can do what I want (or a better approach if there are better pandas/numpy/scipy built in methods) I would be grateful.

As well, any pointers to resources that can help me better summarize/slice/dice my data and be able to visualize at intermediate steps as I proceed with my analysis.

UPDATE:

I am following some of the suggestions in the comments.

I tried:

1) df.hist()

ValueError: The first argument of bincount must be non-negative

2) df[['A']].hist(bins=10,range=(0,10))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000A2350615C0>]], dtype=object)

Isn't #2 suppose to show a plot? instead of producing an object that is not rendered? I am using jupyter notebook.

Is there something I need to turn-on / enable in Jupyter Notebook to render the histogram objects?

UPDATE2:

I solved the rendering problem by: in Ipython notebook, Pandas is not displying the graph I try to plot.

UPDATE3:

As per suggestions from the comments, I started looking through pandas visualization, bokeh and seaborn. However, I'm not sure how I can create linkages between plots.

Lets say I have 10 variables. I want to explore them but since 10 is a large number to explore at once, lets say I want to explore 5 at any given time (r,s,t,u,v).

If I want an interactive hexbin with marginal distributions plot to examine the relationship between r & s, how do I also see the distribution of t, u and v given interactive region selections/slices of r&s (polygons).

I found hexbin with marginal distribution plot here hexbin plot:

But:

1) How to make this interactive (allow selections of polygons)

2) How to link region selections of r & s to other plots, for example 3 histogram plots of t,u, and v (or any other type of plot).

This way, I can navigate through the data more rigorously and explore the relationships in depth.

like image 581
codingknob Avatar asked Aug 26 '16 00:08

codingknob


1 Answers

In order to get the interaction effect you're looking for, you must bin all the columns you care about, together.

The cleanest way I can think of doing this is to stack into a single series then use pd.cut

Considering your sample df

enter image description here

df_ = pd.cut(df[['A', 'B']].stack(), 5, labels=list(range(5))).unstack()
df_.columns = df_.columns.to_series() + 'bkt'
pd.concat([df, df_], axis=1)

enter image description here


Let's build a better example and look at a visualization using seaborn

df = pd.DataFrame(dict(A=(np.random.randn(10000) * 100 + 20).astype(int),
                       B=(np.random.randn(10000) * 100 - 20).astype(int)))

import seaborn as sns

df.index = df.index.to_series().astype(str).radd('city')

df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()
df_.columns = df_.columns.to_series() + 'bkt'

sns.jointplot(x=df_.Abkt, y=df_.Bbkt, kind="scatter", color="k")

enter image description here


Or how about some data with some correlation

mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 100000)
df = pd.DataFrame(data, columns=["A", "B"])

df.index = df.index.to_series().astype(str).radd('city')

df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()
df_.columns = df_.columns.to_series() + 'bkt'

sns.jointplot(x=df_.Abkt, y=df_.Bbkt, kind="scatter", color="k")

enter image description here


Interactive bokeh

Without getting too complicated

from bokeh.io import show, output_notebook, output_file

from bokeh.plotting import figure
from bokeh.layouts import row, column
from bokeh.models import ColumnDataSource, Select, CustomJS

output_notebook()

# generate random data
flips = np.random.choice((1, -1), (5, 5))
flips = np.tril(flips, -1) + np.triu(flips, 1) + np.eye(flips.shape[0])

half = np.ones((5, 5)) / 2
cov = (half + np.diag(np.diag(half))) * flips
mean = np.zeros(5)

data = np.random.multivariate_normal(mean, cov, 10000)
df = pd.DataFrame(data, columns=list('ABCDE'))

df.index = df.index.to_series().astype(str).radd('city')

# Stack and cut to get dependent relationships
b = 20
df_ = pd.cut(df.stack(), b, labels=list(range(b))).unstack()

# assign default columns x and y.  These will be the columns I set bokeh to read
df_[['x', 'y']] = df_.loc[:, ['A', 'B']]

source = ColumnDataSource(data=df_)

tools = 'box_select,pan,box_zoom,wheel_zoom,reset,resize,save'

p = figure(plot_width=600, plot_height=300)
p.circle('x', 'y', source=source, fill_color='olive', line_color='black', alpha=.5)

def gcb(like, n):
    code = """
    var data = source.get('data');
    var f = cb_obj.get('value');
    data['{0}{1}'] = data[f];
    source.trigger('change');
    """
    return CustomJS(args=dict(source=source), code=code.format(like, n))

xcb = CustomJS(
    args=dict(source=source),
    code="""
    var data = source.get('data');
    var colm = cb_obj.get('value');
    data['x'] = data[colm];
    source.trigger('change');
    """
)

ycb = CustomJS(
    args=dict(source=source),
    code="""
    var data = source.get('data');
    var colm = cb_obj.get('value');
    data['y'] = data[colm];
    source.trigger('change');
    """
)

options = list('ABCDE')
x_select = Select(options=options, callback=xcb, value='A')
y_select = Select(options=options, callback=ycb, value='B')


show(column(p, row(x_select, y_select)))

enter image description here

like image 107
piRSquared Avatar answered Sep 18 '22 13:09

piRSquared