I want to use a pivot table to summarise a dataset and then be able to access the information in the pivot table as if it were a DataFrame.
Consider a hierarchical dataset with patients being treated in a hospital and hospitals situated in regions:
import pandas as pd
example_data = {'patient' : ['p1','p2','p3','p4','p5','p6','p7','p8','p9','p10','p11','p12','p13','p14','p15','p16','p17','p18','p19','p20','p21','p22','p23','p24','p25','p26','p27','p28','p29','p30','p31','p32','p33','p34','p35','p36','p37','p38','p39','p40','p41','p42','p43','p44','p45','p46','p47','p48','p49','p50','p51','p52','p53','p54','p55','p56','p57','p58','p59','p60','p61','p62','p63'],
'hospital' : ['h1','h1','h1','h2','h2','h2','h2','h3','h3','h3','h3','h3','h4','h4','h4','h4','h4','h4','h5','h5','h5','h5','h5','h5','h5','h6','h6','h6','h6','h6','h6','h6','h6','h7','h7','h7','h7','h7','h7','h7','h7','h7','h8','h8','h8','h8','h8','h8','h8','h8','h8','h8','h9','h9','h9','h9','h9','h9','h9','h9','h9','h9','h9'],
'region' : ['r1','r1','r1','r1','r1','r1','r1','r1','r1','r1','r1','r1','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r2','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3','r3'] }
example_dataframe = pd.DataFrame(example_data)
print example_dataframe
This produces a simple output as follows:
hospital patient region
0 h1 p1 r1
1 h1 p2 r1
2 h1 p3 r1
3 h2 p4 r1
4 h2 p5 r1
5 h2 p6 r1
6 h2 p7 r1
7 h3 p8 r1
8 h3 p9 r1
9 h3 p10 r1
10 h3 p11 r1
11 h3 p12 r1
12 h4 p13 r2
13 h4 p14 r2
14 h4 p15 r2
15 h4 p16 r2
16 h4 p17 r2
etc.
Now I want to summarise using a pivot table, simply counting the number of patients in each hospital:
example_pivot_table = pd.pivot_table(example_dataframe, values='patient', rows=['hospital','region'], aggfunc='count')
print example_pivot_table
This produces the following output:
hospital region
h1 r1 3
h2 r1 4
h3 r1 5
h4 r2 6
h5 r2 7
h6 r2 8
h7 r3 9
h8 r3 10
h9 r3 11
Name: patient, dtype: int64
As far as I understand it, this is actually a multi-indexed series.
How can I use this data to find out what region hospital h7 is in? If hospital
, region
and the patient count data were separate columns in a DataFrame, it would be easy. But I think hospital and region are indices. I've tried lots of things but haven't been able to get it to work.
You can use get_level_values to get the hospital column. You can pass either the number of level or the name of level, i.e. 0
or hospital
Then you can get what you want by:
In [38]: example_pivot_table[ example_pivot_table.index.get_level_values('hospital') == 'h7' ]
Out[38]:
hospital region
h7 r3 9
Name: patient, dtype: int64
To get the regions, you can do
example_pivot_table[ example_pivot_table.index.get_level_values('hospital') == 'h7' ]['regions']
First up, this isn't a pivot table job, it's a groupby
job.
Pivot tables are for re-shaping your data when you haven't set an index (see this doc article), stack
and unstack
are for reshaping when you have set an index, and groupby
is for aggregating (which is what this is) and split-apply-combine operations.
Here's how to get your patient counts using groupby
:
>>> patient_count = df.groupby(['hospital', 'region']).count()
>>> print patient_count
patient
hospital region
h1 r1 3
h2 r1 4
h3 r1 5
h4 r2 6
h5 r2 7
h6 r2 8
h7 r3 9
h8 r3 10
h9 r3 11
To select certain rows in a multi-index, I usually use ix
as follows:
>>> h7 = patient_count.ix['h7']
>>> print h7
patient
region
r3 9
And now you can use get_level_values
>>> h7.index.values[0]
'r3'
Or, if you don't want the multi-indexed version (and, for your purposes you probably don't) you can do:
>>> patient_count = patient_count.reset_index()
Which allows you to find which region hospital h7
is in as follows:
>>> patient_count.region[patient_count.hospital == 'h7']
6 r3
Name: region, dtype: object
If you just want r3
you can do:
>>> patient_count.region[patient_count.hospital == 'h7'].values[0]
'r3'
Note that reset_index
doesn't happen in-place which makes it great for chaining methods like this:
>>> patient_count.ix['h7'].reset_index().region[0]
'r3'
This will do the trick :
levels = example_pivot_table.columns.levels
labels = example_pivot_table.columns.labels
example_pivot_table.columns = levels[1][labels[1]]
example_pivot_table.reset_index(inplace=True)
example_pivot_table
So,find the levels and labels in your pivot table assign column names and reset index in it. The final result should be the result dataframe of the pivot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With