I have run into the following problem of sorting the row and column headers. Here is how to reproduce this: <pre class="prettyprint"><code>X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100))) A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor g = X.groupby([A,B])['x'].mean() #do a two-way bucketing print g #this gives the following and so far so good x y [-2.315, -0.843] [-2.58, -0.567] -1.041167 (-0.567, 0.0321] -1.722926 (0.0321, 0.724] -1.245856 (0.724, 3.478] -1.240876 (-0.843, -0.228] [-2.58, -0.567] -0.576264 (-0.567, 0.0321] -0.501709 (0.0321, 0.724] -0.522697 (0.724, 3.478] -0.506259 (-0.228, 0.382] [-2.58, -0.567] 0.175768 (-0.567, 0.0321] 0.214353 (0.0321, 0.724] 0.113650 (0.724, 3.478] -0.013758 (0.382, 2.662] [-2.58, -0.567] 0.983807 (-0.567, 0.0321] 1.214640 (0.0321, 0.724] 0.808608 (0.724, 3.478] 1.515334 Name: x, dtype: float64 #Now let's make a two way table and here is the problem: HTML(g.unstack().to_html()) </code></pre> This shows: <pre class="prettyprint"><code>y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] [-2.58, -0.567] x (-0.228, 0.382] 0.214353 0.113650 -0.013758 0.175768 (-0.843, -0.228] -0.501709 -0.522697 -0.506259 -0.576264 (0.382, 2.662] 1.214640 0.808608 1.515334 0.983807 [-2.315, -0.843] -1.722926 -1.245856 -1.240876 -1.041167 </code></pre> Note how the headers are no longer sorted. I am wondering what is a good way to solve this problem so as to making interactive work easy. To further track down where the problem is, run the following: <pre class="prettyprint"><code>g.unstack().columns </code></pre> It gives me this: Index([(-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478], [-2.58, -0.567]], dtype=object) Now compare this with B.levels: <pre class="prettyprint"><code>B.levels Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object) </code></pre> Obviously the order originally in Factor is lost. Now to make the matter even worse, let's do a multi-level cross table: <pre class="prettyprint"><code>g2 = X.groupby([A,B]).agg('mean') g3 = g2.stack().unstack(-2) HTML(g3.to_html()) </code></pre> It shows something like: <pre class="prettyprint"><code>y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] x (-0.228, 0.382] x 0.214353 0.113650 -0.013758 y -0.293465 0.321836 1.180369 (-0.843, -0.228] x -0.501709 -0.522697 -0.506259 y -0.204811 0.324571 1.167005 (0.382, 2.662] x 1.214640 0.808608 1.515334 y -0.195446 0.161198 1.074532 [-2.315, -0.843] x -1.722926 -1.245856 -1.240876 y -0.392896 0.335471 1.730513 </code></pre> With both the row and column labels sorted incorrectly. Thanks.

This seems like a little bit of a hack, but here goes: <pre class="prettyprint"><code>In [11]: g_unstacked = g.unstack() In [12]: g_unstacked Out[12]: y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565] x (-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337 (-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180 (0.625, 1.639] 0.987372 1.006496 0.830710 1.202158 [-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679 </code></pre> Making use the fact that <code>unique</code> preserves order* (grabbing the unique first terms in from the index of g): <pre class="prettyprint"><code>In [13]: g.index.get_level_values(0).unique() Out[13]: array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]', '(0.625, 1.639]'], dtype=object) </code></pre> As you can see, these are in the correct order. Now you can <code>reindex</code> by this: <pre class="prettyprint"><code>In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique()) Out[14]: y (-0.565, 0.12] (0.12, 0.791] (0.791, 2.57] [-2.177, -0.565] [-3.124, -0.892] -1.513954 -1.482813 -1.394198 -1.756679 (-0.892, -0.068] -0.121413 -0.471889 -0.448977 -0.462180 (-0.068, 0.625] 0.389408 0.267252 0.283344 0.258337 (0.625, 1.639] 0.987372 1.006496 0.830710 1.202158 </code></pre> Which is now in the correct order. Update (I missed that the columns were also not in order). You can use the same trick for the columns (you'll have to chain these operations): <pre class="prettyprint"><code>In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1) </code></pre> * this is the reason Series unique is significantly faster than <code>np.unique</code>.

Pandas DataFrame.unstack() Changes Order of Row and Column Headers

Tags:

python

pandas

I have run into the following problem of sorting the row and column headers.

Here is how to reproduce this:

X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100)))
A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor
B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor

g = X.groupby([A,B])['x'].mean() #do a two-way bucketing


print g 

#this gives the following and so far so good

x                 y               
[-2.315, -0.843]  [-2.58, -0.567]    -1.041167
                  (-0.567, 0.0321]   -1.722926
                  (0.0321, 0.724]    -1.245856
                  (0.724, 3.478]     -1.240876
(-0.843, -0.228]  [-2.58, -0.567]    -0.576264
                  (-0.567, 0.0321]   -0.501709
                  (0.0321, 0.724]    -0.522697
                  (0.724, 3.478]     -0.506259
(-0.228, 0.382]   [-2.58, -0.567]     0.175768
                  (-0.567, 0.0321]    0.214353
                  (0.0321, 0.724]     0.113650
                  (0.724, 3.478]     -0.013758
(0.382, 2.662]    [-2.58, -0.567]     0.983807
                  (-0.567, 0.0321]    1.214640
                  (0.0321, 0.724]     0.808608
                  (0.724, 3.478]      1.515334
Name: x, dtype: float64

#Now let's make a two way table and here is the problem:

HTML(g.unstack().to_html())

This shows:

y                 (-0.567, 0.0321]  (0.0321, 0.724]  (0.724, 3.478]  [-2.58, -0.567]
x                                                                                   
(-0.228, 0.382]           0.214353         0.113650       -0.013758         0.175768
(-0.843, -0.228]         -0.501709        -0.522697       -0.506259        -0.576264
(0.382, 2.662]            1.214640         0.808608        1.515334         0.983807
[-2.315, -0.843]         -1.722926        -1.245856       -1.240876        -1.041167

Note how the headers are no longer sorted. I am wondering what is a good way to solve this problem so as to making interactive work easy.

To further track down where the problem is, run the following:

g.unstack().columns

It gives me this: Index([(-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478], [-2.58, -0.567]], dtype=object)

Now compare this with B.levels:

B.levels
Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object)

Obviously the order originally in Factor is lost.

Now to make the matter even worse, let's do a multi-level cross table:

g2 = X.groupby([A,B]).agg('mean')
g3 = g2.stack().unstack(-2)
HTML(g3.to_html())

It shows something like:

y                   (-0.567, 0.0321]  (0.0321, 0.724]  (0.724, 3.478]  
x                                                                       
(-0.228, 0.382]  x          0.214353         0.113650       -0.013758   
                 y         -0.293465         0.321836        1.180369   
(-0.843, -0.228] x         -0.501709        -0.522697       -0.506259   
                 y         -0.204811         0.324571        1.167005   
(0.382, 2.662]   x          1.214640         0.808608        1.515334   
                 y         -0.195446         0.161198        1.074532   
[-2.315, -0.843] x         -1.722926        -1.245856       -1.240876   
                 y         -0.392896         0.335471        1.730513

With both the row and column labels sorted incorrectly.

Thanks.

928

asked Jun 17 '13 20:06

Tom Bennett

1 Answers

This seems like a little bit of a hack, but here goes:

In [11]: g_unstacked = g.unstack()

In [12]: g_unstacked
Out[12]:
y                 (-0.565, 0.12]  (0.12, 0.791]  (0.791, 2.57]  [-2.177, -0.565]
x
(-0.068, 0.625]         0.389408       0.267252       0.283344          0.258337
(-0.892, -0.068]       -0.121413      -0.471889      -0.448977         -0.462180
(0.625, 1.639]          0.987372       1.006496       0.830710          1.202158
[-3.124, -0.892]       -1.513954      -1.482813      -1.394198         -1.756679

Making use the fact that unique preserves order* (grabbing the unique first terms in from the index of g):

In [13]: g.index.get_level_values(0).unique()
Out[13]:
array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]',
       '(0.625, 1.639]'], dtype=object)

As you can see, these are in the correct order.

Now you can reindex by this:

In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique())
Out[14]:
y                 (-0.565, 0.12]  (0.12, 0.791]  (0.791, 2.57]  [-2.177, -0.565]
[-3.124, -0.892]       -1.513954      -1.482813      -1.394198         -1.756679
(-0.892, -0.068]       -0.121413      -0.471889      -0.448977         -0.462180
(-0.068, 0.625]         0.389408       0.267252       0.283344          0.258337
(0.625, 1.639]          0.987372       1.006496       0.830710          1.202158

Which is now in the correct order.

Update (I missed that the columns were also not in order).
You can use the same trick for the columns (you'll have to chain these operations):

In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1)

* this is the reason Series unique is significantly faster than np.unique.

151

answered Sep 19 '22 05:09

Andy Hayden

Related questions
                            
                                Highlight text in a PDF with Python [closed]
                            
                                Programmatically defining a class: type vs types.new_class
                            
                                Differences between generator comprehension expressions
                            
                                Adding packages to Python "embedded" installation for Windows
                            
                                numpy.unique gives wrong output for list of sets
                            
                                Pylons error - 'MySQL server has gone away'
                            
                                In Django, how do you retrieve data from extra fields on many-to-many relationships without an explicit query for it?
                            
                                ctypes for static libraries?
                            
                                Is there a way to set metaclass after the class definition?
                            
                                Checking if property is settable/deletable
                            
                                efficient function to retrieve a queryset of ancestors of an mptt queryset
                            
                                Python equivalent of unix cksum function
                            
                                Python AppIndicator bindings -> howto check if the menu is open?
                            
                                Why does pickle protocol 2 let me serialise an open file object?
                            
                                Python performance vs PHP [closed]
                            
                                Making custom containers work with **kwargs (how does Python expand the args?)
                            
                                Cartesian product of large iterators (itertools)
                            
                                Learning and using augmented Bayes classifiers in python
                            
                                Python tools for out-of-core computation/data mining
                            
                                How to duplicate rows in pandas, based on items in a list [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With