I have run into the following problem of sorting the row and column headers.
Here is how to reproduce this:
X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100)))
A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor
B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor
g = X.groupby([A,B])['x'].mean() #do a two-way bucketing
print g
#this gives the following and so far so good
x y
[-2.315, -0.843] [-2.58, -0.567] -1.041167
(-0.567, 0.0321] -1.722926
(0.0321, 0.724] -1.245856
(0.724, 3.478] -1.240876
(-0.843, -0.228] [-2.58, -0.567] -0.576264
(-0.567, 0.0321] -0.501709
(0.0321, 0.724] -0.522697
(0.724, 3.478] -0.506259
(-0.228, 0.382] [-2.58, -0.567] 0.175768
(-0.567, 0.0321] 0.214353
(0.0321, 0.724] 0.113650
(0.724, 3.478] -0.013758
(0.382, 2.662] [-2.58, -0.567] 0.983807
(-0.567, 0.0321] 1.214640
(0.0321, 0.724] 0.808608
(0.724, 3.478] 1.515334
Name: x, dtype: float64
#Now let's make a two way table and here is the problem:
HTML(g.unstack().to_html())
This shows:
y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] [-2.58, -0.567]
x
(-0.228, 0.382] 0.214353 0.113650 -0.013758 0.175768
(-0.843, -0.228] -0.501709 -0.522697 -0.506259 -0.576264
(0.382, 2.662] 1.214640 0.808608 1.515334 0.983807
[-2.315, -0.843] -1.722926 -1.245856 -1.240876 -1.041167
Note how the headers are no longer sorted. I am wondering what is a good way to solve this problem so as to making interactive work easy.
To further track down where the problem is, run the following:
g.unstack().columns
It gives me this: Index([(-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478], [-2.58, -0.567]], dtype=object)
Now compare this with B.levels:
B.levels
Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object)
Obviously the order originally in Factor is lost.
Now to make the matter even worse, let's do a multi-level cross table:
g2 = X.groupby([A,B]).agg('mean')
g3 = g2.stack().unstack(-2)
HTML(g3.to_html())
It shows something like:
y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478]
x
(-0.228, 0.382] x 0.214353 0.113650 -0.013758
y -0.293465 0.321836 1.180369
(-0.843, -0.228] x -0.501709 -0.522697 -0.506259
y -0.204811 0.324571 1.167005
(0.382, 2.662] x 1.214640 0.808608 1.515334
y -0.195446 0.161198 1.074532
[-2.315, -0.843] x -1.722926 -1.245856 -1.240876
y -0.392896 0.335471 1.730513
With both the row and column labels sorted incorrectly.
Thanks.
By calling unstack() , it unstack the inner-most index onto column.
Pandas Unstack is a function that pivots the level of the indexed columns in a stacked dataframe.
Reorder Columns using Pandas . Another way to reorder columns is to use the Pandas . reindex() method. This allows you to pass in the columns= parameter to pass in the order of columns that you want to use.
No, it does not work for missing values. Then you start doing dropna or fillna on various columns that are not matching.
This seems like a little bit of a hack, but here goes:
In [11]: g_unstacked = g.unstack()
In [12]: g_unstacked
Out[12]:
y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565]
x
(-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337
(-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180
(0.625, 1.639] 0.987372 1.006496 0.830710 1.202158
[-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679
Making use the fact that unique
preserves order* (grabbing the unique first terms in from the index of g):
In [13]: g.index.get_level_values(0).unique()
Out[13]:
array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]',
'(0.625, 1.639]'], dtype=object)
As you can see, these are in the correct order.
Now you can reindex
by this:
In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique())
Out[14]:
y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565]
[-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679
(-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180
(-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337
(0.625, 1.639] 0.987372 1.006496 0.830710 1.202158
Which is now in the correct order.
Update (I missed that the columns were also not in order).
You can use the same trick for the columns (you'll have to chain these operations):
In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1)
* this is the reason Series unique is significantly faster than np.unique
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With