I have a df that looks like:
df.head()
Out[1]:
A B C
city0 40 12 73
city1 65 56 10
city2 77 58 71
city3 89 53 49
city4 33 98 90
An example df can be created by the following code:
df = pd.DataFrame(np.random.randint(100,size=(1000000,3)), columns=list('ABC'))
indx = ['city'+str(x) for x in range(0,1000000)]
df.index = indx
What I want to do is:
a) determine appropriate histogram bucket lengths for column A and assign each city to a bucket for column A
b) determine appropriate histogram bucket lengths for column B and assign each city to a bucket for column B
Maybe the resulting df looks like (or is there a better built in way in pandas?)
df.head()
Out[1]:
A B C Abkt Bbkt
city0 40 12 73 2 1
city1 65 56 10 4 3
city2 77 58 71 4 3
city3 89 53 49 5 3
city4 33 98 90 2 5
Where Abkt and Bbkt are histogram bucket identifiers:
1-20 = 1
21-40 = 2
41-60 = 3
61-80 = 4
81-100 = 5
Ultimately, I want to better understand the behavior of each city with respect to columns A, B and C and be able to answer questions like:
a) What does the distribution of Column A (or B) look like - i.e. what buckets are most/least populated.
b) Conditional on a particular slice/bucket of Column A, what does the distribution of Column B look like - i.e. what buckets are most/least populated.
c) Conditional on a particular slice/bucket of Column A and B, what does the behavior of C look like.
Ideally, I want to be able to visualize the data (heat maps, region identifiers etc). I'm a relative pandas/python newbie and don't know what is possible to develop.
If the SO community can kindly provide code examples of how I can do what I want (or a better approach if there are better pandas/numpy/scipy built in methods) I would be grateful.
As well, any pointers to resources that can help me better summarize/slice/dice my data and be able to visualize at intermediate steps as I proceed with my analysis.
UPDATE:
I am following some of the suggestions in the comments.
I tried:
1) df.hist()
ValueError: The first argument of bincount must be non-negative
2) df[['A']].hist(bins=10,range=(0,10))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000A2350615C0>]], dtype=object)
Isn't #2
suppose to show a plot? instead of producing an object that is not rendered? I am using jupyter notebook
.
Is there something I need to turn-on / enable in Jupyter Notebook
to render the histogram objects?
UPDATE2:
I solved the rendering problem by: in Ipython notebook, Pandas is not displying the graph I try to plot.
UPDATE3:
As per suggestions from the comments, I started looking through pandas visualization, bokeh and seaborn. However, I'm not sure how I can create linkages between plots.
Lets say I have 10 variables. I want to explore them but since 10 is a large number to explore at once, lets say I want to explore 5 at any given time (r,s,t,u,v).
If I want an interactive hexbin with marginal distributions plot to examine the relationship between r & s, how do I also see the distribution of t, u and v given interactive region selections/slices of r&s (polygons).
I found hexbin with marginal distribution plot here hexbin plot:
But:
1) How to make this interactive (allow selections of polygons)
2) How to link region selections of r & s to other plots, for example 3 histogram plots of t,u, and v (or any other type of plot).
This way, I can navigate through the data more rigorously and explore the relationships in depth.
In order to get the interaction effect you're looking for, you must bin all the columns you care about, together.
The cleanest way I can think of doing this is to stack
into a single series
then use pd.cut
Considering your sample df
df_ = pd.cut(df[['A', 'B']].stack(), 5, labels=list(range(5))).unstack()
df_.columns = df_.columns.to_series() + 'bkt'
pd.concat([df, df_], axis=1)
Let's build a better example and look at a visualization using seaborn
df = pd.DataFrame(dict(A=(np.random.randn(10000) * 100 + 20).astype(int),
B=(np.random.randn(10000) * 100 - 20).astype(int)))
import seaborn as sns
df.index = df.index.to_series().astype(str).radd('city')
df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()
df_.columns = df_.columns.to_series() + 'bkt'
sns.jointplot(x=df_.Abkt, y=df_.Bbkt, kind="scatter", color="k")
Or how about some data with some correlation
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 100000)
df = pd.DataFrame(data, columns=["A", "B"])
df.index = df.index.to_series().astype(str).radd('city')
df_ = pd.cut(df[['A', 'B']].stack(), 30, labels=list(range(30))).unstack()
df_.columns = df_.columns.to_series() + 'bkt'
sns.jointplot(x=df_.Abkt, y=df_.Bbkt, kind="scatter", color="k")
bokeh
Without getting too complicated
from bokeh.io import show, output_notebook, output_file
from bokeh.plotting import figure
from bokeh.layouts import row, column
from bokeh.models import ColumnDataSource, Select, CustomJS
output_notebook()
# generate random data
flips = np.random.choice((1, -1), (5, 5))
flips = np.tril(flips, -1) + np.triu(flips, 1) + np.eye(flips.shape[0])
half = np.ones((5, 5)) / 2
cov = (half + np.diag(np.diag(half))) * flips
mean = np.zeros(5)
data = np.random.multivariate_normal(mean, cov, 10000)
df = pd.DataFrame(data, columns=list('ABCDE'))
df.index = df.index.to_series().astype(str).radd('city')
# Stack and cut to get dependent relationships
b = 20
df_ = pd.cut(df.stack(), b, labels=list(range(b))).unstack()
# assign default columns x and y. These will be the columns I set bokeh to read
df_[['x', 'y']] = df_.loc[:, ['A', 'B']]
source = ColumnDataSource(data=df_)
tools = 'box_select,pan,box_zoom,wheel_zoom,reset,resize,save'
p = figure(plot_width=600, plot_height=300)
p.circle('x', 'y', source=source, fill_color='olive', line_color='black', alpha=.5)
def gcb(like, n):
code = """
var data = source.get('data');
var f = cb_obj.get('value');
data['{0}{1}'] = data[f];
source.trigger('change');
"""
return CustomJS(args=dict(source=source), code=code.format(like, n))
xcb = CustomJS(
args=dict(source=source),
code="""
var data = source.get('data');
var colm = cb_obj.get('value');
data['x'] = data[colm];
source.trigger('change');
"""
)
ycb = CustomJS(
args=dict(source=source),
code="""
var data = source.get('data');
var colm = cb_obj.get('value');
data['y'] = data[colm];
source.trigger('change');
"""
)
options = list('ABCDE')
x_select = Select(options=options, callback=xcb, value='A')
y_select = Select(options=options, callback=ycb, value='B')
show(column(p, row(x_select, y_select)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With