Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame.unstack() Changes Order of Row and Column Headers

Tags:

python

pandas

I have run into the following problem of sorting the row and column headers.

Here is how to reproduce this:

X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100)))
A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor
B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor

g = X.groupby([A,B])['x'].mean() #do a two-way bucketing


print g 

#this gives the following and so far so good

x                 y               
[-2.315, -0.843]  [-2.58, -0.567]    -1.041167
                  (-0.567, 0.0321]   -1.722926
                  (0.0321, 0.724]    -1.245856
                  (0.724, 3.478]     -1.240876
(-0.843, -0.228]  [-2.58, -0.567]    -0.576264
                  (-0.567, 0.0321]   -0.501709
                  (0.0321, 0.724]    -0.522697
                  (0.724, 3.478]     -0.506259
(-0.228, 0.382]   [-2.58, -0.567]     0.175768
                  (-0.567, 0.0321]    0.214353
                  (0.0321, 0.724]     0.113650
                  (0.724, 3.478]     -0.013758
(0.382, 2.662]    [-2.58, -0.567]     0.983807
                  (-0.567, 0.0321]    1.214640
                  (0.0321, 0.724]     0.808608
                  (0.724, 3.478]      1.515334
Name: x, dtype: float64

#Now let's make a two way table and here is the problem:

HTML(g.unstack().to_html())

This shows:

y                 (-0.567, 0.0321]  (0.0321, 0.724]  (0.724, 3.478]  [-2.58, -0.567]
x                                                                                   
(-0.228, 0.382]           0.214353         0.113650       -0.013758         0.175768
(-0.843, -0.228]         -0.501709        -0.522697       -0.506259        -0.576264
(0.382, 2.662]            1.214640         0.808608        1.515334         0.983807
[-2.315, -0.843]         -1.722926        -1.245856       -1.240876        -1.041167

Note how the headers are no longer sorted. I am wondering what is a good way to solve this problem so as to making interactive work easy.

To further track down where the problem is, run the following:

g.unstack().columns

It gives me this: Index([(-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478], [-2.58, -0.567]], dtype=object)

Now compare this with B.levels:

B.levels
Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object)

Obviously the order originally in Factor is lost.

Now to make the matter even worse, let's do a multi-level cross table:

g2 = X.groupby([A,B]).agg('mean')
g3 = g2.stack().unstack(-2)
HTML(g3.to_html())

It shows something like:

y                   (-0.567, 0.0321]  (0.0321, 0.724]  (0.724, 3.478]  
x                                                                       
(-0.228, 0.382]  x          0.214353         0.113650       -0.013758   
                 y         -0.293465         0.321836        1.180369   
(-0.843, -0.228] x         -0.501709        -0.522697       -0.506259   
                 y         -0.204811         0.324571        1.167005   
(0.382, 2.662]   x          1.214640         0.808608        1.515334   
                 y         -0.195446         0.161198        1.074532   
[-2.315, -0.843] x         -1.722926        -1.245856       -1.240876   
                 y         -0.392896         0.335471        1.730513  

With both the row and column labels sorted incorrectly.

Thanks.

like image 928
Tom Bennett Avatar asked Jun 17 '13 20:06

Tom Bennett


People also ask

What is the impact of calling the unstack () method on a DataFrame?

By calling unstack() , it unstack the inner-most index onto column.

What does unstack () do in pandas?

Pandas Unstack is a function that pivots the level of the indexed columns in a stacked dataframe.

How do I rearrange the order of columns in pandas?

Reorder Columns using Pandas . Another way to reorder columns is to use the Pandas . reindex() method. This allows you to pass in the columns= parameter to pass in the order of columns that you want to use.

Does column order matter in DataFrame?

No, it does not work for missing values. Then you start doing dropna or fillna on various columns that are not matching.


1 Answers

This seems like a little bit of a hack, but here goes:

In [11]: g_unstacked = g.unstack()

In [12]: g_unstacked
Out[12]:
y                 (-0.565, 0.12]  (0.12, 0.791]  (0.791, 2.57]  [-2.177, -0.565]
x
(-0.068, 0.625]         0.389408       0.267252       0.283344          0.258337
(-0.892, -0.068]       -0.121413      -0.471889      -0.448977         -0.462180
(0.625, 1.639]          0.987372       1.006496       0.830710          1.202158
[-3.124, -0.892]       -1.513954      -1.482813      -1.394198         -1.756679

Making use the fact that unique preserves order* (grabbing the unique first terms in from the index of g):

In [13]: g.index.get_level_values(0).unique()
Out[13]:
array(['[-3.124, -0.892]', '(-0.892, -0.068]', '(-0.068, 0.625]',
       '(0.625, 1.639]'], dtype=object)

As you can see, these are in the correct order.

Now you can reindex by this:

In [14]: g_unstacked.reindex(g.index.get_level_values(0).unique())
Out[14]:
y                 (-0.565, 0.12]  (0.12, 0.791]  (0.791, 2.57]  [-2.177, -0.565]
[-3.124, -0.892]       -1.513954      -1.482813      -1.394198         -1.756679
(-0.892, -0.068]       -0.121413      -0.471889      -0.448977         -0.462180
(-0.068, 0.625]         0.389408       0.267252       0.283344          0.258337
(0.625, 1.639]          0.987372       1.006496       0.830710          1.202158

Which is now in the correct order.

Update (I missed that the columns were also not in order).
You can use the same trick for the columns (you'll have to chain these operations):

In [15]: g_unstacked.reindex_axis(g.index.get_level_values(1).unique(), axis=1)

* this is the reason Series unique is significantly faster than np.unique.

like image 151
Andy Hayden Avatar answered Sep 19 '22 05:09

Andy Hayden